Text preprocessing is the mandatory first stage of every NLP pipeline. Raw text from the web, PDFs, or user input is messy — it contains HTML tags, inconsistent casing, contractions, punctuation, and irrelevant filler words. Preprocessing transforms this noise into clean, normalised token sequences that downstream models can learn from efficiently. The standard stages are: tokenisation, case folding, stop-word removal, stemming or lemmatisation, and text normalisation.
Real-life analogy: The recipe ingredients
Before a chef cooks, they wash vegetables, peel them, and cut them to uniform sizes. A recipe that calls for 'diced onion' cannot use a whole onion with the skin on. Text preprocessing does the same: it washes (removes noise), peels (removes stop words), and dices (tokenises) raw text into a uniform form that machine learning models can use as ingredients.
Stage 1 — Tokenisation
Tokenisation splits a text stream into discrete units called tokens. Word tokenisation splits on whitespace and punctuation. Sentence tokenisation splits on sentence boundaries. Sub-word tokenisation (BPE, WordPiece) splits words into smaller fragments — used by BERT, GPT, and all modern LLMs.
Word and sentence tokenisation with NLTK
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Dr. Smith runs 5km daily. He won't stop until he reaches his goal!"
# Sentence tokenisation
sentences = sent_tokenize(text)
print(sentences)
# ['Dr. Smith runs 5km daily.', "He won't stop until he reaches his goal!"]
# Word tokenisation
tokens = word_tokenize(text)
print(tokens)
# ['Dr.', 'Smith', 'runs', '5km', 'daily', '.', 'He', "wo", "n't", 'stop', ...]
# Sub-word tokenisation (BPE) with Hugging Face
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("unhappiness", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))
# ['[CLS]', 'un', '##happiness', '[SEP]'] <- sub-word splitWhy sub-word tokenisation matters
Word-level tokenisation creates huge vocabularies (100k+ words) and cannot handle rare or misspelled words. Sub-word methods like BPE keep vocabulary size manageable (~30-50k tokens) while handling any word. "unhappiness" becomes ["un", "##happiness"] — the model understands the prefix "un" from other words like "unkind".
Stage 2 — Stop words, stemming and lemmatisation
Stop word removal: Common words (the, is, at, which) that appear in almost every document carry little discriminative information for tasks like classification or search. Removing them reduces noise. Caution: do NOT remove stop words for tasks where word order matters (sentiment, translation, QA).
Stemming chops word endings using heuristic rules: running, runs, runner all become run. Fast but crude — it often creates non-words (studies → studi). Lemmatisation uses a vocabulary and morphological analysis to find the dictionary root (lemma): better → good, running → run, geese → goose. Slower but linguistically accurate.
Stemming vs lemmatisation comparison
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "better", "geese", "caring"]
print("Word Stem Lemma")
print("-" * 40)
for word in words:
stem = stemmer.stem(word)
lemma = lemmatizer.lemmatize(word, pos='v') # 'v' = verb context
print(f"{word:<14} {stem:<11} {lemma}")
# Output:
# running run run
# studies studi study <- stem is wrong, lemma correct
# better better better <- needs POS context (adj) for "good"
# geese gees goose <- lemma handles irregular plural
# caring care care| Technique | Output | Speed | Accuracy | Use case |
|---|---|---|---|---|
| Stemming | May not be real word | Fast O(n) | Low-Medium | Search indexing, IR |
| Lemmatisation | Always valid word | Slower | High | Text classification, NLU |
Stage 3 — Text normalisation
Normalisation makes text consistent: case folding (All Caps → all lower), expanding contractions (can't → cannot), removing HTML/noise, handling emojis and slang for social media text, and number normalisation (100, one hundred, 1e2 all representing the same value).
Text normalisation pipeline with regex
import re
def normalise(text: str) -> str:
text = text.lower() # Case folding
text = re.sub(r'<[^>]+>', ' ', text) # Remove HTML tags
text = re.sub(r'httpS+|wwwS+', '[URL]', text) # Replace URLs
text = re.sub(r'@w+', '[USER]', text) # Replace @mentions
text = re.sub(r'd+', '[NUM]', text) # Normalise numbers
text = re.sub(r"can't", "cannot", text)
text = re.sub(r"n't", " not", text) # Expand contractions
text = re.sub(r"'re", " are", text)
text = re.sub(r's+', ' ', text).strip() # Collapse whitespace
return text
samples = [
"<b>Breaking:</b> Apple's stock fell 3% today! can't believe it",
"Visit https://example.com for more @news info 2024",
]
for s in samples:
print(f"Raw: {s}")
print(f"Norm: {normalise(s)}")
print()Practice questions
- What is the key difference between stemming and lemmatisation? (Answer: Stemming uses heuristic rules and may produce non-words. Lemmatisation uses vocabulary lookup and always produces a valid dictionary form.)
- Why should stop words NOT be removed for machine translation? (Answer: Stop words like "not", "the", "a" carry grammatical meaning needed to produce correctly structured target-language sentences.)
- A tokeniser splits "New York" into two tokens. What problem does this create for NER? (Answer: Multi-word entities like "New York" get split — the model may not recognise them as a single location. Solutions: n-gram tokenisation or BPE sub-word units.)
- What does BPE stand for and why is it used in LLMs? (Answer: Byte Pair Encoding. It balances vocabulary size with coverage — handles rare/unseen words by splitting them into known sub-words, keeping vocab to ~30-50k tokens instead of millions.)
- Convert "running" to its stem using Porter Stemmer. (Answer: "run" — Porter Stemmer removes the -ning suffix following its suffix-stripping rules.)
On LumiChats
LumiChats preprocesses your input text through tokenisation and normalisation before sending it to AI models. Understanding this pipeline helps you write better prompts — e.g., very long inputs get truncated at the token limit, not the word limit.
Try it free