What is evaluation metrics for classification?

Text Classification & Sentiment Analysis: Evaluation metrics for classification. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/text-classification-sentiment

What is practice questions?

Text Classification & Sentiment Analysis: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/text-classification-sentiment

Text Classification & Sentiment Analysis

Text classification assigns a predefined label to a piece of text. Sentiment analysis is the most common special case: classifying text as positive, negative, or neutral. Classification powers spam filters, news categorization, intent detection in chatbots, hate speech detection, medical triage, and product review analysis. Methods range from traditional Naive Bayes and SVM to fine-tuned transformer models (BERT) that achieve human-level accuracy on many benchmarks.

Teaching machines to assign categories to text — from spam detection to emotion detection.

Category: Natural Language Processing

Real-life analogy: The email triage clerk

Imagine a company receives thousands of emails daily. A clerk reads each one and sorts it into folders: Sales Inquiry, Technical Support, Billing, Spam, Complaint. This is multi-class text classification. The clerk learns patterns from experience: emails containing 'refund' and 'angry' go to Complaints; emails with 'free money' and 'click here' go to Spam. Machine learning models learn exactly these patterns — automatically, from labeled examples.

Naive Bayes for text classification

Naive Bayes is the classic text classifier. It applies Bayes theorem with the naive conditional independence assumption: given the class, all words are independent. Despite this unrealistic assumption, it works well for spam filtering and topic classification.

P(c \mid d) \propto P(c) \prod_{i=1}^{n} P(w_i \mid c)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample data (in practice: 10k+ examples)
texts  = [
    "Win free iPhone now click here",
    "Meeting at 3pm tomorrow in conference room",
    "Claim your prize you have been selected",
    "Can you review the quarterly report",
    "FREE CASH no credit check apply now",
    "Hi please find attached the project update",
]
labels = ["spam", "ham", "spam", "ham", "spam", "ham"]

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42)

# Pipeline: TF-IDF vectorization + Naive Bayes
clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1, 2))),  # unigrams + bigrams
    ("nb",    MultinomialNB(alpha=0.1)),               # Laplace smoothing
])
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))

# Predict new email
print(clf.predict(["Congratulations! You won $1000 click to claim"]))  # ['spam']

Sentiment analysis — beyond binary

Binary sentiment: positive / negative. Fine-grained sentiment: 1-5 star rating prediction. Aspect-based sentiment analysis (ABSA): 'The food was great but the service was terrible' — positive on food aspect, negative on service aspect. ABSA requires identifying both the aspect term and its polarity.

from transformers import pipeline

# Zero-shot sentiment (no fine-tuning needed)
sentiment = pipeline("sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english")

texts = [
    "This product is absolutely amazing, I love it!",
    "Worst purchase I have ever made. Do not buy.",
    "It was okay, nothing special but does the job.",
]
for text in texts:
    result = sentiment(text)[0]
    print(f"{result['label']:<10} ({result['score']:.2%})  {text[:50]}")

# Output:
# POSITIVE   (99.98%)  This product is absolutely amazing...
# NEGATIVE   (99.91%)  Worst purchase I have ever made...
# NEGATIVE   (52.38%)  It was okay, nothing special...

# For ABSA: use aspect-specific models or prompt LLMs
from transformers import pipeline as llm_pipeline
# "Classify the sentiment toward [food] in: 'food great service terrible'"

Method	Data needed	Accuracy	Training time	Best for
Naive Bayes + TF-IDF	Small (100s)	Medium (~85%)	Seconds	Fast prototyping, spam
SVM + TF-IDF	Medium (1k+)	Good (~88%)	Minutes	Short text, interpretable
LSTM/RNN	Large (10k+)	Better (~91%)	Hours	Sequential patterns
BERT fine-tuned	Medium (1k+ often)	SOTA (~95%)	Hours (GPU)	Production NLP tasks
GPT-4 zero-shot	None	Very good (~92%)	Instant	When no labeled data

Evaluation metrics for classification

\text{Precision} = \frac{TP}{TP+FP} \quad \text{Recall} = \frac{TP}{TP+FN} \quad F_1 = \frac{2 \cdot P \cdot R}{P + R}

Accuracy is misleading for imbalanced data: If 99% of emails are ham and 1% are spam, a classifier that always predicts ham gets 99% accuracy but 0% recall on spam. Always report precision, recall, and F1 per class. For medical diagnosis, recall (sensitivity) is critical — missing a disease (false negative) is worse than a false alarm.

Practice questions

Naive Bayes classifies "free money" as spam. The word "free" never appeared in ham training data. What problem occurs? (Answer: Zero probability — P("free"|ham) = 0, making the entire product 0. Solution: Laplace (add-1) smoothing adds 1 to all word counts.)
A spam filter has precision 0.95, recall 0.70. What does this mean? (Answer: Of predicted spam, 95% are actually spam (low false positives). But it misses 30% of actual spam (low recall). Better to tune toward high recall for spam.)
Why is BERT better than TF-IDF + SVM for sentiment on "I am not unhappy"? (Answer: TF-IDF misses double negation. BERT reads the full contextual sequence bidirectionally and understands "not unhappy" ≈ positive.)
What is aspect-based sentiment analysis? Give an example. (Answer: Identifying sentiment at the entity level. "The battery life is poor but the screen is gorgeous" → battery: NEGATIVE, screen: POSITIVE.)
Name two evaluation metrics besides accuracy for text classification. (Answer: F1 score (harmonic mean of precision and recall) and AUROC (area under the ROC curve — measures discrimination ability at all thresholds).)

LumiChats uses sentiment-aware context when generating responses — it detects frustration or confusion in your messages and adjusts its tone accordingly. The same classification models power intent detection: are you asking a question, giving a command, or providing feedback?

Definition

Real-life analogy: The email triage clerk

Naive Bayes for text classification

Naive Bayes: probability of class c given document d (word sequence w_1...w_n). P(c) = class prior. P(w_i|c) = likelihood of word w_i in class c. Take the class with highest posterior.

Spam classifier with Naive Bayes and TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Sample data (in practice: 10k+ examples)
texts  = [
    "Win free iPhone now click here",
    "Meeting at 3pm tomorrow in conference room",
    "Claim your prize you have been selected",
    "Can you review the quarterly report",
    "FREE CASH no credit check apply now",
    "Hi please find attached the project update",
]
labels = ["spam", "ham", "spam", "ham", "spam", "ham"]

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42)

# Pipeline: TF-IDF vectorization + Naive Bayes
clf = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1, 2))),  # unigrams + bigrams
    ("nb",    MultinomialNB(alpha=0.1)),               # Laplace smoothing
])
clf.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))

# Predict new email
print(clf.predict(["Congratulations! You won $1000 click to claim"]))  # ['spam']

Sentiment analysis — beyond binary

Sentiment analysis with Hugging Face transformers

from transformers import pipeline

# Zero-shot sentiment (no fine-tuning needed)
sentiment = pipeline("sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english")

texts = [
    "This product is absolutely amazing, I love it!",
    "Worst purchase I have ever made. Do not buy.",
    "It was okay, nothing special but does the job.",
]
for text in texts:
    result = sentiment(text)[0]
    print(f"{result['label']:<10} ({result['score']:.2%})  {text[:50]}")

# Output:
# POSITIVE   (99.98%)  This product is absolutely amazing...
# NEGATIVE   (99.91%)  Worst purchase I have ever made...
# NEGATIVE   (52.38%)  It was okay, nothing special...

# For ABSA: use aspect-specific models or prompt LLMs
from transformers import pipeline as llm_pipeline
# "Classify the sentiment toward [food] in: 'food great service terrible'"

Method	Data needed	Accuracy	Training time	Best for
Naive Bayes + TF-IDF	Small (100s)	Medium (~85%)	Seconds	Fast prototyping, spam
SVM + TF-IDF	Medium (1k+)	Good (~88%)	Minutes	Short text, interpretable
LSTM/RNN	Large (10k+)	Better (~91%)	Hours	Sequential patterns
BERT fine-tuned	Medium (1k+ often)	SOTA (~95%)	Hours (GPU)	Production NLP tasks
GPT-4 zero-shot	None	Very good (~92%)	Instant	When no labeled data

Evaluation metrics for classification

Precision: of all predicted positives, how many are actually positive. Recall: of all actual positives, how many were found. F1: harmonic mean — useful when class distribution is imbalanced. Use macro-F1 for multi-class when classes are equally important.

Accuracy is misleading for imbalanced data

If 99% of emails are ham and 1% are spam, a classifier that always predicts ham gets 99% accuracy but 0% recall on spam. Always report precision, recall, and F1 per class. For medical diagnosis, recall (sensitivity) is critical — missing a disease (false negative) is worse than a false alarm.

Practice questions

Naive Bayes classifies "free money" as spam. The word "free" never appeared in ham training data. What problem occurs? (Answer: Zero probability — P("free"|ham) = 0, making the entire product 0. Solution: Laplace (add-1) smoothing adds 1 to all word counts.)
A spam filter has precision 0.95, recall 0.70. What does this mean? (Answer: Of predicted spam, 95% are actually spam (low false positives). But it misses 30% of actual spam (low recall). Better to tune toward high recall for spam.)
Why is BERT better than TF-IDF + SVM for sentiment on "I am not unhappy"? (Answer: TF-IDF misses double negation. BERT reads the full contextual sequence bidirectionally and understands "not unhappy" ≈ positive.)
What is aspect-based sentiment analysis? Give an example. (Answer: Identifying sentiment at the entity level. "The battery life is poor but the screen is gorgeous" → battery: NEGATIVE, screen: POSITIVE.)
Name two evaluation metrics besides accuracy for text classification. (Answer: F1 score (harmonic mean of precision and recall) and AUROC (area under the ROC curve — measures discrimination ability at all thresholds).)

On LumiChats

Try it free

Text Classification & Sentiment Analysis

Real-life analogy: The email triage clerk

Naive Bayes for text classification

Sentiment analysis — beyond binary

Evaluation metrics for classification

Practice questions

Text Classification & Sentiment Analysis

Real-life analogy: The email triage clerk

Naive Bayes for text classification

Sentiment analysis — beyond binary

Evaluation metrics for classification

Practice questions

Practice what you just learned

Related Terms