Glossary/BERT & Encoder-Only Models
Natural Language Processing

BERT & Encoder-Only Models

Understanding language by reading in both directions.


Definition

BERT (Bidirectional Encoder Representations from Transformers) is a landmark NLP model released by Google in 2018. Unlike GPT's left-to-right generation, BERT's encoder-only architecture reads text bidirectionally — each token sees the full context in both directions. BERT was pretrained on masked language modeling and next sentence prediction, then fine-tuned on downstream NLP tasks, setting new state-of-the-art across 11 NLP benchmarks.

BERT's pretraining objectives

BERT was the first model to demonstrate that bidirectional pretraining dramatically outperforms left-to-right language models on understanding tasks. It uses two self-supervised objectives:

ObjectiveWhat it doesExampleWhy it matters
Masked Language Modeling (MLM)Randomly mask 15% of tokens; predict from context"The [MASK] sat on the mat" → "cat"Forces model to use both left AND right context — bidirectional understanding
Next Sentence Prediction (NSP)Predict if sentence B follows sentence AIsNextSentence or NotNextSentenceIntended for sentence-pair tasks (QA, entailment) — later shown less useful

BERT masked language modeling inference

from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
import torch

# High-level: use fill-mask pipeline
mlm = pipeline("fill-mask", model="bert-base-uncased")
results = mlm("The [MASK] sat on the mat.")
for r in results[:3]:
    print(f"{r['token_str']:12s} ({r['score']:.3f})")
# cat        (0.812)
# dog        (0.043)
# man        (0.018)

# Lower-level: get BERT hidden states (for downstream use)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The quick brown fox", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Last hidden state: (batch=1, seq_len, hidden=768)
hidden_states = outputs.hidden_states[-1]
cls_embedding = hidden_states[:, 0, :]   # [CLS] token for classification tasks
token_embeddings = hidden_states[:, 1:-1, :]  # word tokens (no special tokens)

MLM masking strategy

Of the 15% selected tokens: 80% are replaced with [MASK], 10% replaced with a random token, 10% kept unchanged. The 10% random and 10% unchanged prevent the model from learning to only predict masked positions — it must maintain good representations for all tokens.

BERT architecture and variants

ModelLayersHeadsHidden dimParamsKey improvement
BERT-Base1212768110MOriginal — bidirectional encoder baseline
BERT-Large24161024340MLarger, better — expensive to fine-tune
RoBERTa12–2412–16768–1024125M–355MBetter training: more data, no NSP, dynamic masking
DistilBERT61276866M40% smaller, 60% faster, 97% of BERT quality via distillation
ALBERT121276812MCross-layer weight sharing + factorized embeddings — tiny but effective
DeBERTa v312–2412–16768–102486M–900MDisentangled attention (content + position separate) — SOTA encoder

Which BERT variant to use in 2025

For most classification/NER tasks: DeBERTa-v3-base (86M params) outperforms the original BERT-Large at a fraction of the size. For sentence embeddings and RAG: all-MiniLM-L6-v2 or all-mpnet-base-v2 (Sentence-BERT). For multilingual: mDeBERTa-v3-base covers 100 languages.

Fine-tuning BERT for downstream tasks

BERT's pretrained representations serve as a universal starting point for NLP tasks — just add a small task head and fine-tune for 2–5 epochs:

Fine-tuning BERT for text classification (sentiment analysis)

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np

model_name = "distilbert-base-uncased"   # 66M params — fast, good quality
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2  # positive / negative
)

dataset = load_dataset("imdb")
tokenized = dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, max_length=512),
    batched=True
)

args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,          # Small LR — pretrained weights are delicate
    eval_strategy="epoch",
    fp16=True,
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"].shuffle(seed=42).select(range(5000)),
    eval_dataset=tokenized["test"].select(range(1000)),
    compute_metrics=lambda p: {
        "accuracy": (np.argmax(p.predictions, axis=1) == p.label_ids).mean()
    }
)
trainer.train()
# Typical result: ~93% accuracy on IMDb after 3 epochs, ~10 minutes on a T4 GPU

Encoder vs decoder models

ArchitectureExamplesAttention patternBest forCan generate text?
Encoder-only (BERT family)BERT, RoBERTa, DeBERTa, ALBERTFully bidirectional — each token sees all othersClassification, NER, extractive QA, embeddings❌ No
Decoder-only (GPT family)GPT-4, LLaMA 3, Mistral, ClaudeCausal — each token sees only past tokensText generation, chat, code, instruction following✅ Yes
Encoder-decoder (T5/BART family)T5, BART, FLAN-T5, mT5Bidirectional encoder + causal decoderSummarization, translation, abstractive QA✅ Yes

Why decoder-only models won

GPT-style decoder-only models scaled better. With enough data and compute, they can handle understanding tasks too (via prompting), making separate encoder models less necessary. By 2023, instruction-tuned decoder-only LLMs (GPT-4, Claude, LLaMA) outperform BERT on most NLU tasks — despite BERT's architectural advantage for bidirectional understanding.

Sentence embeddings from BERT

Vanilla BERT produces poor sentence embeddings — averaging token vectors results in anisotropic representations (all embeddings clustered in a narrow cone). Sentence-BERT (SBERT) fixes this:

SBERT sentence embeddings for semantic similarity and RAG

from sentence_transformers import SentenceTransformer, util
import torch

# Pretrained SBERT model fine-tuned specifically for semantic similarity
model = SentenceTransformer("all-MiniLM-L6-v2")  # 22M params, fast, great quality

# Encode sentences to fixed 384-dim vectors
sentences = [
    "What is the capital of France?",
    "Paris is the capital city of France.",
    "The Eiffel Tower is in Paris.",
    "Python is a programming language.",
]
embeddings = model.encode(sentences, convert_to_tensor=True)  # (4, 384)

# Semantic similarity (cosine)
sims = util.cos_sim(embeddings, embeddings)
print("Q vs Answer 1:", sims[0][1].item())   # → ~0.85 (high — semantically related)
print("Q vs Answer 2:", sims[0][2].item())   # → ~0.62 (medium)
print("Q vs Code:", sims[0][3].item())        # → ~0.10 (low — unrelated)

# For RAG: encode your document chunks once, store in vector DB
# At query time: encode query, find top-k by cosine similarity

Best embedding models in 2025

For English: text-embedding-3-large (OpenAI, 3072 dim), voyage-3 (Voyage AI), or all-mpnet-base-v2 (open source, 768 dim). For multilingual: multilingual-e5-large or mE5-mistral-7b (open source). For code: CodeBERT, UniXcoder. For production RAG: voyage-3 or text-embedding-3-large consistently top MTEB leaderboard.

Practice questions

  1. What is masked language modeling (MLM) and how does it pre-train BERT? (Answer: MLM: randomly mask 15% of input tokens ([MASK]), train the model to predict the original token from context. Unlike causal LM (GPT), which only attends to previous tokens, MLM gives BERT bidirectional context — each token attends to all tokens. This enables BERT to learn deep bidirectional representations: predicting [MASK] in 'The [MASK] sat on the mat' uses both 'The' and 'on the mat'. The 15% masking: 80% replaced with [MASK], 10% random word, 10% unchanged — prevents the model from only learning [MASK] tokens.)
  2. What is BERT's next sentence prediction (NSP) task and why was it later found to be unhelpful? (Answer: NSP: train BERT to classify whether sentence B follows sentence A in the original text (50% true pairs, 50% random). Intended to improve sentence-pair tasks (NLI, QA). RoBERTa (2019) ablated NSP and found removing it IMPROVED performance on most benchmarks. NSP was too easy — the model learned topic mismatch rather than real discourse understanding. Random sentence pairs often have different topics, making classification trivial. Modern BERT variants (RoBERTa, DeBERTa) drop NSP, training longer on MLM only.)
  3. What is the difference between BERT-base and BERT-large, and when would you choose each? (Answer: BERT-base: 12 layers, 12 heads, 768 d_model, 110M parameters. BERT-large: 24 layers, 16 heads, 1024 d_model, 340M parameters. BERT-large scores ~2–4 points higher on GLUE benchmarks. BERT-base: preferred for production (2-3× faster inference, 3× less memory). BERT-large: preferred for research or when maximum accuracy matters and you have GPU budget. For most production NLP tasks (NER, classification, sentence similarity): BERT-base fine-tuned on domain data outperforms BERT-large on general data.)
  4. What is the [CLS] token in BERT and how is it used for classification? (Answer: [CLS] (classification token): prepended to every input sequence. During pre-training, the hidden state of [CLS] at the final layer is used to predict NSP — so BERT is trained to aggregate sequence-level information into [CLS]. For downstream classification fine-tuning: take the [CLS] final hidden state (768-dim vector), add a linear classifier head on top, and fine-tune on labeled data. [CLS] acts as a sequence summary vector. Alternative: mean-pool all token embeddings — often performs similarly or better for sentence similarity tasks (Sentence-BERT).)
  5. What is DeBERTa and what two innovations improved on BERT's attention? (Answer: DeBERTa (He et al., Microsoft 2020): (1) Disentangled attention: content and position embeddings are separate — attention between tokens computed using 4 terms: content-to-content, content-to-position, position-to-content, position-to-position. Better position-aware representations. (2) Enhanced mask decoder (EMD): uses absolute position information in the final decoding layers while using relative position in attention. DeBERTa-v3 (186M params) outperforms BERT-large (340M) and RoBERTa-large (355M) on GLUE/SuperGLUE while being smaller and faster.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms