Glossary/NLP Text Preprocessing Pipeline
Natural Language Processing

NLP Text Preprocessing Pipeline

Cleaning and normalizing raw text before any model can understand it.


Definition

Text preprocessing is the mandatory first stage of every NLP pipeline. Raw text from the web, PDFs, or user input is messy — it contains HTML tags, inconsistent casing, contractions, punctuation, and irrelevant filler words. Preprocessing transforms this noise into clean, normalized token sequences that downstream models can learn from efficiently. The standard stages are: tokenization, case folding, stop-word removal, stemming or lemmatization, and text normalization.

Real-life analogy: The recipe ingredients

Before a chef cooks, they wash vegetables, peel them, and cut them to uniform sizes. A recipe that calls for 'diced onion' cannot use a whole onion with the skin on. Text preprocessing does the same: it washes (removes noise), peels (removes stop words), and dices (tokenises) raw text into a uniform form that machine learning models can use as ingredients.

Stage 1 — Tokenization

Tokenization splits a text stream into discrete units called tokens. Word tokenization splits on whitespace and punctuation. Sentence tokenization splits on sentence boundaries. Sub-word tokenization (BPE, WordPiece) splits words into smaller fragments — used by BERT, GPT, and all modern LLMs.

Word and sentence tokenization with NLTK

import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith runs 5km daily. He won't stop until he reaches his goal!"

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
# ['Dr. Smith runs 5km daily.', "He won't stop until he reaches his goal!"]

# Word tokenization  
tokens = word_tokenize(text)
print(tokens)
# ['Dr.', 'Smith', 'runs', '5km', 'daily', '.', 'He', "wo", "n't", 'stop', ...]

# Sub-word tokenization (BPE) with Hugging Face
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("unhappiness", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))
# ['[CLS]', 'un', '##happiness', '[SEP]']  <- sub-word split

Why sub-word tokenization matters

Word-level tokenization creates huge vocabularies (100k+ words) and cannot handle rare or misspelled words. Sub-word methods like BPE keep vocabulary size manageable (~30-50k tokens) while handling any word. "unhappiness" becomes ["un", "##happiness"] — the model understands the prefix "un" from other words like "unkind".

Stage 2 — Stop words, stemming and lemmatization

Stop word removal: Common words (the, is, at, which) that appear in almost every document carry little discriminative information for tasks like classification or search. Removing them reduces noise. Caution: do NOT remove stop words for tasks where word order matters (sentiment, translation, QA).

Stemming chops word endings using heuristic rules: running, runs, runner all become run. Fast but crude — it often creates non-words (studies → studi). Lemmatization uses a vocabulary and morphological analysis to find the dictionary root (lemma): better → good, running → run, geese → goose. Slower but linguistically accurate.

Stemming vs lemmatization comparison

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)

stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better", "geese", "caring"]

print("Word          Stem       Lemma")
print("-" * 40)
for word in words:
    stem  = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')   # 'v' = verb context
    print(f"{word:<14} {stem:<11} {lemma}")

# Output:
# running        run         run
# studies        studi       study        <- stem is wrong, lemma correct
# better         better      better       <- needs POS context (adj) for "good"
# geese          gees        goose        <- lemma handles irregular plural
# caring         care        care
TechniqueOutputSpeedAccuracyUse case
StemmingMay not be real wordFast O(n)Low-MediumSearch indexing, IR
LemmatizationAlways valid wordSlowerHighText classification, NLU

Stage 3 — Text normalization

Normalization makes text consistent: case folding (All Caps → all lower), expanding contractions (can't → cannot), removing HTML/noise, handling emojis and slang for social media text, and number normalization (100, one hundred, 1e2 all representing the same value).

Text normalization pipeline with regex

import re

def normalize(text: str) -> str:
    text = text.lower()                             # Case folding
    text = re.sub(r'<[^>]+>', ' ', text)            # Remove HTML tags
    text = re.sub(r'httpS+|wwwS+', '[URL]', text) # Replace URLs
    text = re.sub(r'@w+', '[USER]', text)          # Replace @mentions
    text = re.sub(r'd+', '[NUM]', text)            # Normalise numbers
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)             # Expand contractions
    text = re.sub(r"'re", " are", text)
    text = re.sub(r's+', ' ', text).strip()        # Collapse whitespace
    return text

samples = [
    "<b>Breaking:</b> Apple's stock fell 3% today! can't believe it",
    "Visit https://example.com for more @news info 2024",
]
for s in samples:
    print(f"Raw:  {s}")
    print(f"Norm: {normalize(s)}")
    print()

Practice questions

  1. What is the key difference between stemming and lemmatization? (Answer: Stemming uses heuristic rules and may produce non-words. Lemmatization uses vocabulary lookup and always produces a valid dictionary form.)
  2. Why should stop words NOT be removed for machine translation? (Answer: Stop words like "not", "the", "a" carry grammatical meaning needed to produce correctly structured target-language sentences.)
  3. A tokeniser splits "New York" into two tokens. What problem does this create for NER? (Answer: Multi-word entities like "New York" get split — the model may not recognize them as a single location. Solutions: n-gram tokenization or BPE sub-word units.)
  4. What does BPE stand for and why is it used in LLMs? (Answer: Byte Pair Encoding. It balances vocabulary size with coverage — handles rare/unseen words by splitting them into known sub-words, keeping vocab to ~30-50k tokens instead of millions.)
  5. Convert "running" to its stem using Porter Stemmer. (Answer: "run" — Porter Stemmer removes the -ning suffix following its suffix-stripping rules.)

On LumiChats

LumiChats preprocesses your input text through tokenization and normalization before sending it to AI models. Understanding this pipeline helps you write better prompts — e.g., very long inputs get truncated at the token limit, not the word limit.

Try it free

✦ Under $1 / day

Practice what you just learned

Quiz Hub + Study Mode lock in every concept. 40+ AI models, Agent Mode, page-locked answers — all for less than a dollar a day.

Start Free — Under $1/day

Related Terms

4 terms