Glossary/NLP Text Preprocessing Pipeline
Natural Language Processing

NLP Text Preprocessing Pipeline

Cleaning and normalising raw text before any model can understand it.


Definition

Text preprocessing is the mandatory first stage of every NLP pipeline. Raw text from the web, PDFs, or user input is messy — it contains HTML tags, inconsistent casing, contractions, punctuation, and irrelevant filler words. Preprocessing transforms this noise into clean, normalised token sequences that downstream models can learn from efficiently. The standard stages are: tokenisation, case folding, stop-word removal, stemming or lemmatisation, and text normalisation.

Real-life analogy: The recipe ingredients

Before a chef cooks, they wash vegetables, peel them, and cut them to uniform sizes. A recipe that calls for 'diced onion' cannot use a whole onion with the skin on. Text preprocessing does the same: it washes (removes noise), peels (removes stop words), and dices (tokenises) raw text into a uniform form that machine learning models can use as ingredients.

Stage 1 — Tokenisation

Tokenisation splits a text stream into discrete units called tokens. Word tokenisation splits on whitespace and punctuation. Sentence tokenisation splits on sentence boundaries. Sub-word tokenisation (BPE, WordPiece) splits words into smaller fragments — used by BERT, GPT, and all modern LLMs.

Word and sentence tokenisation with NLTK

import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith runs 5km daily. He won't stop until he reaches his goal!"

# Sentence tokenisation
sentences = sent_tokenize(text)
print(sentences)
# ['Dr. Smith runs 5km daily.', "He won't stop until he reaches his goal!"]

# Word tokenisation  
tokens = word_tokenize(text)
print(tokens)
# ['Dr.', 'Smith', 'runs', '5km', 'daily', '.', 'He', "wo", "n't", 'stop', ...]

# Sub-word tokenisation (BPE) with Hugging Face
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("unhappiness", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))
# ['[CLS]', 'un', '##happiness', '[SEP]']  <- sub-word split

Why sub-word tokenisation matters

Word-level tokenisation creates huge vocabularies (100k+ words) and cannot handle rare or misspelled words. Sub-word methods like BPE keep vocabulary size manageable (~30-50k tokens) while handling any word. "unhappiness" becomes ["un", "##happiness"] — the model understands the prefix "un" from other words like "unkind".

Stage 2 — Stop words, stemming and lemmatisation

Stop word removal: Common words (the, is, at, which) that appear in almost every document carry little discriminative information for tasks like classification or search. Removing them reduces noise. Caution: do NOT remove stop words for tasks where word order matters (sentiment, translation, QA).

Stemming chops word endings using heuristic rules: running, runs, runner all become run. Fast but crude — it often creates non-words (studies → studi). Lemmatisation uses a vocabulary and morphological analysis to find the dictionary root (lemma): better → good, running → run, geese → goose. Slower but linguistically accurate.

Stemming vs lemmatisation comparison

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)

stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better", "geese", "caring"]

print("Word          Stem       Lemma")
print("-" * 40)
for word in words:
    stem  = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')   # 'v' = verb context
    print(f"{word:<14} {stem:<11} {lemma}")

# Output:
# running        run         run
# studies        studi       study        <- stem is wrong, lemma correct
# better         better      better       <- needs POS context (adj) for "good"
# geese          gees        goose        <- lemma handles irregular plural
# caring         care        care
TechniqueOutputSpeedAccuracyUse case
StemmingMay not be real wordFast O(n)Low-MediumSearch indexing, IR
LemmatisationAlways valid wordSlowerHighText classification, NLU

Stage 3 — Text normalisation

Normalisation makes text consistent: case folding (All Caps → all lower), expanding contractions (can't → cannot), removing HTML/noise, handling emojis and slang for social media text, and number normalisation (100, one hundred, 1e2 all representing the same value).

Text normalisation pipeline with regex

import re

def normalise(text: str) -> str:
    text = text.lower()                             # Case folding
    text = re.sub(r'<[^>]+>', ' ', text)            # Remove HTML tags
    text = re.sub(r'httpS+|wwwS+', '[URL]', text) # Replace URLs
    text = re.sub(r'@w+', '[USER]', text)          # Replace @mentions
    text = re.sub(r'd+', '[NUM]', text)            # Normalise numbers
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)             # Expand contractions
    text = re.sub(r"'re", " are", text)
    text = re.sub(r's+', ' ', text).strip()        # Collapse whitespace
    return text

samples = [
    "<b>Breaking:</b> Apple's stock fell 3% today! can't believe it",
    "Visit https://example.com for more @news info 2024",
]
for s in samples:
    print(f"Raw:  {s}")
    print(f"Norm: {normalise(s)}")
    print()

Practice questions

  1. What is the key difference between stemming and lemmatisation? (Answer: Stemming uses heuristic rules and may produce non-words. Lemmatisation uses vocabulary lookup and always produces a valid dictionary form.)
  2. Why should stop words NOT be removed for machine translation? (Answer: Stop words like "not", "the", "a" carry grammatical meaning needed to produce correctly structured target-language sentences.)
  3. A tokeniser splits "New York" into two tokens. What problem does this create for NER? (Answer: Multi-word entities like "New York" get split — the model may not recognise them as a single location. Solutions: n-gram tokenisation or BPE sub-word units.)
  4. What does BPE stand for and why is it used in LLMs? (Answer: Byte Pair Encoding. It balances vocabulary size with coverage — handles rare/unseen words by splitting them into known sub-words, keeping vocab to ~30-50k tokens instead of millions.)
  5. Convert "running" to its stem using Porter Stemmer. (Answer: "run" — Porter Stemmer removes the -ning suffix following its suffix-stripping rules.)

On LumiChats

LumiChats preprocesses your input text through tokenisation and normalisation before sending it to AI models. Understanding this pipeline helps you write better prompts — e.g., very long inputs get truncated at the token limit, not the word limit.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms