What is practice questions?

NLP Text Preprocessing Pipeline: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/nlp-text-preprocessing

NLP Text Preprocessing Pipeline

Text preprocessing is the mandatory first stage of every NLP pipeline. Raw text from the web, PDFs, or user input is messy — it contains HTML tags, inconsistent casing, contractions, punctuation, and irrelevant filler words. Preprocessing transforms this noise into clean, normalized token sequences that downstream models can learn from efficiently. The standard stages are: tokenization, case folding, stop-word removal, stemming or lemmatization, and text normalization.

Cleaning and normalizing raw text before any model can understand it.

Category: Natural Language Processing

Real-life analogy: The recipe ingredients

Before a chef cooks, they wash vegetables, peel them, and cut them to uniform sizes. A recipe that calls for 'diced onion' cannot use a whole onion with the skin on. Text preprocessing does the same: it washes (removes noise), peels (removes stop words), and dices (tokenises) raw text into a uniform form that machine learning models can use as ingredients.

Stage 1 — Tokenization

Tokenization splits a text stream into discrete units called tokens. Word tokenization splits on whitespace and punctuation. Sentence tokenization splits on sentence boundaries. Sub-word tokenization (BPE, WordPiece) splits words into smaller fragments — used by BERT, GPT, and all modern LLMs.

import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith runs 5km daily. He won't stop until he reaches his goal!"

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
# ['Dr. Smith runs 5km daily.', "He won't stop until he reaches his goal!"]

# Word tokenization  
tokens = word_tokenize(text)
print(tokens)
# ['Dr.', 'Smith', 'runs', '5km', 'daily', '.', 'He', "wo", "n't", 'stop', ...]

# Sub-word tokenization (BPE) with Hugging Face
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("unhappiness", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))
# ['[CLS]', 'un', '##happiness', '[SEP]']  <- sub-word split

Why sub-word tokenization matters: Word-level tokenization creates huge vocabularies (100k+ words) and cannot handle rare or misspelled words. Sub-word methods like BPE keep vocabulary size manageable (~30-50k tokens) while handling any word. "unhappiness" becomes ["un", "##happiness"] — the model understands the prefix "un" from other words like "unkind".

Stage 2 — Stop words, stemming and lemmatization

Stop word removal: Common words (the, is, at, which) that appear in almost every document carry little discriminative information for tasks like classification or search. Removing them reduces noise. Caution: do NOT remove stop words for tasks where word order matters (sentiment, translation, QA).

Stemming chops word endings using heuristic rules: running, runs, runner all become run. Fast but crude — it often creates non-words (studies → studi). Lemmatization uses a vocabulary and morphological analysis to find the dictionary root (lemma): better → good, running → run, geese → goose. Slower but linguistically accurate.

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)

stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better", "geese", "caring"]

print("Word          Stem       Lemma")
print("-" * 40)
for word in words:
    stem  = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')   # 'v' = verb context
    print(f"{word:<14} {stem:<11} {lemma}")

# Output:
# running        run         run
# studies        studi       study        <- stem is wrong, lemma correct
# better         better      better       <- needs POS context (adj) for "good"
# geese          gees        goose        <- lemma handles irregular plural
# caring         care        care

Technique	Output	Speed	Accuracy	Use case
Stemming	May not be real word	Fast O(n)	Low-Medium	Search indexing, IR
Lemmatization	Always valid word	Slower	High	Text classification, NLU

Stage 3 — Text normalization

Normalization makes text consistent: case folding (All Caps → all lower), expanding contractions (can't → cannot), removing HTML/noise, handling emojis and slang for social media text, and number normalization (100, one hundred, 1e2 all representing the same value).

import re

def normalize(text: str) -> str:
    text = text.lower()                             # Case folding
    text = re.sub(r'<[^>]+>', ' ', text)            # Remove HTML tags
    text = re.sub(r'httpS+|wwwS+', '[URL]', text) # Replace URLs
    text = re.sub(r'@w+', '[USER]', text)          # Replace @mentions
    text = re.sub(r'd+', '[NUM]', text)            # Normalise numbers
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)             # Expand contractions
    text = re.sub(r"'re", " are", text)
    text = re.sub(r's+', ' ', text).strip()        # Collapse whitespace
    return text

samples = [
    "<b>Breaking:</b> Apple's stock fell 3% today! can't believe it",
    "Visit https://example.com for more @news info 2024",
]
for s in samples:
    print(f"Raw:  {s}")
    print(f"Norm: {normalize(s)}")
    print()

Practice questions

What is the key difference between stemming and lemmatization? (Answer: Stemming uses heuristic rules and may produce non-words. Lemmatization uses vocabulary lookup and always produces a valid dictionary form.)
Why should stop words NOT be removed for machine translation? (Answer: Stop words like "not", "the", "a" carry grammatical meaning needed to produce correctly structured target-language sentences.)
A tokeniser splits "New York" into two tokens. What problem does this create for NER? (Answer: Multi-word entities like "New York" get split — the model may not recognize them as a single location. Solutions: n-gram tokenization or BPE sub-word units.)
What does BPE stand for and why is it used in LLMs? (Answer: Byte Pair Encoding. It balances vocabulary size with coverage — handles rare/unseen words by splitting them into known sub-words, keeping vocab to ~30-50k tokens instead of millions.)
Convert "running" to its stem using Porter Stemmer. (Answer: "run" — Porter Stemmer removes the -ning suffix following its suffix-stripping rules.)

LumiChats preprocesses your input text through tokenization and normalization before sending it to AI models. Understanding this pipeline helps you write better prompts — e.g., very long inputs get truncated at the token limit, not the word limit.

Definition

Real-life analogy: The recipe ingredients

Stage 1 — Tokenization

Word and sentence tokenization with NLTK

import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith runs 5km daily. He won't stop until he reaches his goal!"

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
# ['Dr. Smith runs 5km daily.', "He won't stop until he reaches his goal!"]

# Word tokenization  
tokens = word_tokenize(text)
print(tokens)
# ['Dr.', 'Smith', 'runs', '5km', 'daily', '.', 'He', "wo", "n't", 'stop', ...]

# Sub-word tokenization (BPE) with Hugging Face
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded = tokenizer("unhappiness", return_tensors="pt")
print(tokenizer.convert_ids_to_tokens(encoded['input_ids'][0]))
# ['[CLS]', 'un', '##happiness', '[SEP]']  <- sub-word split

Why sub-word tokenization matters

Word-level tokenization creates huge vocabularies (100k+ words) and cannot handle rare or misspelled words. Sub-word methods like BPE keep vocabulary size manageable (~30-50k tokens) while handling any word. "unhappiness" becomes ["un", "##happiness"] — the model understands the prefix "un" from other words like "unkind".

Stage 2 — Stop words, stemming and lemmatization

Stemming vs lemmatization comparison

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download(['wordnet', 'averaged_perceptron_tagger'], quiet=True)

stemmer    = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better", "geese", "caring"]

print("Word          Stem       Lemma")
print("-" * 40)
for word in words:
    stem  = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word, pos='v')   # 'v' = verb context
    print(f"{word:<14} {stem:<11} {lemma}")

# Output:
# running        run         run
# studies        studi       study        <- stem is wrong, lemma correct
# better         better      better       <- needs POS context (adj) for "good"
# geese          gees        goose        <- lemma handles irregular plural
# caring         care        care

Technique	Output	Speed	Accuracy	Use case
Stemming	May not be real word	Fast O(n)	Low-Medium	Search indexing, IR
Lemmatization	Always valid word	Slower	High	Text classification, NLU

Stage 3 — Text normalization

Text normalization pipeline with regex

import re

def normalize(text: str) -> str:
    text = text.lower()                             # Case folding
    text = re.sub(r'<[^>]+>', ' ', text)            # Remove HTML tags
    text = re.sub(r'httpS+|wwwS+', '[URL]', text) # Replace URLs
    text = re.sub(r'@w+', '[USER]', text)          # Replace @mentions
    text = re.sub(r'd+', '[NUM]', text)            # Normalise numbers
    text = re.sub(r"can't", "cannot", text)
    text = re.sub(r"n't", " not", text)             # Expand contractions
    text = re.sub(r"'re", " are", text)
    text = re.sub(r's+', ' ', text).strip()        # Collapse whitespace
    return text

samples = [
    "<b>Breaking:</b> Apple's stock fell 3% today! can't believe it",
    "Visit https://example.com for more @news info 2024",
]
for s in samples:
    print(f"Raw:  {s}")
    print(f"Norm: {normalize(s)}")
    print()

Practice questions

What is the key difference between stemming and lemmatization? (Answer: Stemming uses heuristic rules and may produce non-words. Lemmatization uses vocabulary lookup and always produces a valid dictionary form.)
Why should stop words NOT be removed for machine translation? (Answer: Stop words like "not", "the", "a" carry grammatical meaning needed to produce correctly structured target-language sentences.)
A tokeniser splits "New York" into two tokens. What problem does this create for NER? (Answer: Multi-word entities like "New York" get split — the model may not recognize them as a single location. Solutions: n-gram tokenization or BPE sub-word units.)
What does BPE stand for and why is it used in LLMs? (Answer: Byte Pair Encoding. It balances vocabulary size with coverage — handles rare/unseen words by splitting them into known sub-words, keeping vocab to ~30-50k tokens instead of millions.)
Convert "running" to its stem using Porter Stemmer. (Answer: "run" — Porter Stemmer removes the -ning suffix following its suffix-stripping rules.)

On LumiChats

Try it free

NLP Text Preprocessing Pipeline

Real-life analogy: The recipe ingredients

Stage 1 — Tokenization

Stage 2 — Stop words, stemming and lemmatization

Stage 3 — Text normalization

Practice questions

NLP Text Preprocessing Pipeline

Real-life analogy: The recipe ingredients

Stage 1 — Tokenization

Stage 2 — Stop words, stemming and lemmatization

Stage 3 — Text normalization

Practice questions

Practice what you just learned

Related Terms