Text preprocessing is the mandatory first stage of every NLP pipeline. Raw text from the web, PDFs, or user input is messy — it contains HTML tags, inconsistent casing, contractions, punctuation, and irrelevant filler words. Preprocessing transforms this noise into clean, normalized token sequences that downstream models can learn from efficiently. The standard stages are: tokenization, case folding, stop-word removal, stemming or lemmatization, and text normalization.
Real-life analogy: The recipe ingredients
Before a chef cooks, they wash vegetables, peel them, and cut them to uniform sizes. A recipe that calls for 'diced onion' cannot use a whole onion with the skin on. Text preprocessing does the same: it washes (removes noise), peels (removes stop words), and dices (tokenises) raw text into a uniform form that machine learning models can use as ingredients.
Stage 1 — Tokenization
Tokenization splits a text stream into discrete units called tokens. Word tokenization splits on whitespace and punctuation. Sentence tokenization splits on sentence boundaries. Sub-word tokenization (BPE, WordPiece) splits words into smaller fragments — used by BERT, GPT, and all modern LLMs.
Word and sentence tokenization with NLTK
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Dr. Smith runs 5km daily. He won't stop until he reaches his goal!"
# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
# ['Dr. Smith runs 5km daily.', "He won't stop until he reaches his goal!"]
# Word tokenization
tokens = word_tokenize(text)
print(tokens)
# ['Dr.', 'Smith', 'runs', '5km', 'daily', '.', 'He', "wo", "n't", 'stop', ...]
# Sub-word tokenization (BPE) with Hugging Face
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("unhappiness", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))
# ['[CLS]', 'un', '##happiness', '[SEP]'] <- sub-word splitWhy sub-word tokenization matters
Word-level tokenization creates huge vocabularies (100k+ words) and cannot handle rare or misspelled words. Sub-word methods like BPE keep vocabulary size manageable (~30-50k tokens) while handling any word. "unhappiness" becomes ["un", "##happiness"] — the model understands the prefix "un" from other words like "unkind".
Stage 2 — Stop words, stemming and lemmatization
Stop word removal: Common words (the, is, at, which) that appear in almost every document carry little discriminative information for tasks like classification or search. Removing them reduces noise. Caution: do NOT remove stop words for tasks where word order matters (sentiment, translation, QA).
Stemming chops word endings using heuristic rules: running, runs, runner all become run. Fast but crude — it often creates non-words (studies → studi). Lemmatization uses a vocabulary and morphological analysis to find the dictionary root (lemma): better → good, running → run, geese → goose. Slower but linguistically accurate.
Stemming vs lemmatization comparison
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "better", "geese", "caring"]
print("Word Stem Lemma")
print("-" * 40)
for word in words:
stem = stemmer.stem(word)
lemma = lemmatizer.lemmatize(word, pos='v') # 'v' = verb context
print(f"{word:<14} {stem:<11} {lemma}")
# Output:
# running run run
# studies studi study <- stem is wrong, lemma correct
# better better better <- needs POS context (adj) for "good"
# geese gees goose <- lemma handles irregular plural
# caring care care| Technique | Output | Speed | Accuracy | Use case |
|---|---|---|---|---|
| Stemming | May not be real word | Fast O(n) | Low-Medium | Search indexing, IR |
| Lemmatization | Always valid word | Slower | High | Text classification, NLU |
Stage 3 — Text normalization
Normalization makes text consistent: case folding (All Caps → all lower), expanding contractions (can't → cannot), removing HTML/noise, handling emojis and slang for social media text, and number normalization (100, one hundred, 1e2 all representing the same value).
Text normalization pipeline with regex
import re
def normalize(text: str) -> str:
text = text.lower() # Case folding
text = re.sub(r'<[^>]+>', ' ', text) # Remove HTML tags
text = re.sub(r'httpS+|wwwS+', '[URL]', text) # Replace URLs
text = re.sub(r'@w+', '[USER]', text) # Replace @mentions
text = re.sub(r'd+', '[NUM]', text) # Normalise numbers
text = re.sub(r"can't", "cannot", text)
text = re.sub(r"n't", " not", text) # Expand contractions
text = re.sub(r"'re", " are", text)
text = re.sub(r's+', ' ', text).strip() # Collapse whitespace
return text
samples = [
"<b>Breaking:</b> Apple's stock fell 3% today! can't believe it",
"Visit https://example.com for more @news info 2024",
]
for s in samples:
print(f"Raw: {s}")
print(f"Norm: {normalize(s)}")
print()Practice questions
- What is the key difference between stemming and lemmatization? (Answer: Stemming uses heuristic rules and may produce non-words. Lemmatization uses vocabulary lookup and always produces a valid dictionary form.)
- Why should stop words NOT be removed for machine translation? (Answer: Stop words like "not", "the", "a" carry grammatical meaning needed to produce correctly structured target-language sentences.)
- A tokeniser splits "New York" into two tokens. What problem does this create for NER? (Answer: Multi-word entities like "New York" get split — the model may not recognize them as a single location. Solutions: n-gram tokenization or BPE sub-word units.)
- What does BPE stand for and why is it used in LLMs? (Answer: Byte Pair Encoding. It balances vocabulary size with coverage — handles rare/unseen words by splitting them into known sub-words, keeping vocab to ~30-50k tokens instead of millions.)
- Convert "running" to its stem using Porter Stemmer. (Answer: "run" — Porter Stemmer removes the -ning suffix following its suffix-stripping rules.)
On LumiChats
LumiChats preprocesses your input text through tokenization and normalization before sending it to AI models. Understanding this pipeline helps you write better prompts — e.g., very long inputs get truncated at the token limit, not the word limit.
Try it free