What is bERT architecture and variants?

BERT & Encoder-Only Models: BERT architecture and variants. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/bert

What is encoder vs decoder models?

BERT & Encoder-Only Models: Encoder vs decoder models. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/bert

What is practice questions?

BERT & Encoder-Only Models: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/bert

BERT & Encoder-Only Models

BERT (Bidirectional Encoder Representations from Transformers) is a landmark NLP model released by Google in 2018. Unlike GPT's left-to-right generation, BERT's encoder-only architecture reads text bidirectionally — each token sees the full context in both directions. BERT was pretrained on masked language modeling and next sentence prediction, then fine-tuned on downstream NLP tasks, setting new state-of-the-art across 11 NLP benchmarks.

Understanding language by reading in both directions.

Category: Natural Language Processing

BERT's pretraining objectives

BERT was the first model to demonstrate that bidirectional pretraining dramatically outperforms left-to-right language models on understanding tasks. It uses two self-supervised objectives:

Objective	What it does	Example	Why it matters
Masked Language Modeling (MLM)	Randomly mask 15% of tokens; predict from context	"The [MASK] sat on the mat" → "cat"	Forces model to use both left AND right context — bidirectional understanding
Next Sentence Prediction (NSP)	Predict if sentence B follows sentence A	IsNextSentence or NotNextSentence	Intended for sentence-pair tasks (QA, entailment) — later shown less useful

from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
import torch

# High-level: use fill-mask pipeline
mlm = pipeline("fill-mask", model="bert-base-uncased")
results = mlm("The [MASK] sat on the mat.")
for r in results[:3]:
    print(f"{r['token_str']:12s} ({r['score']:.3f})")
# cat        (0.812)
# dog        (0.043)
# man        (0.018)

# Lower-level: get BERT hidden states (for downstream use)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The quick brown fox", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Last hidden state: (batch=1, seq_len, hidden=768)
hidden_states = outputs.hidden_states[-1]
cls_embedding = hidden_states[:, 0, :]   # [CLS] token for classification tasks
token_embeddings = hidden_states[:, 1:-1, :]  # word tokens (no special tokens)

MLM masking strategy: Of the 15% selected tokens: 80% are replaced with [MASK], 10% replaced with a random token, 10% kept unchanged. The 10% random and 10% unchanged prevent the model from learning to only predict masked positions — it must maintain good representations for all tokens.

BERT architecture and variants

Model	Layers	Heads	Hidden dim	Params	Key improvement
BERT-Base	12	12	768	110M	Original — bidirectional encoder baseline
BERT-Large	24	16	1024	340M	Larger, better — expensive to fine-tune
RoBERTa	12–24	12–16	768–1024	125M–355M	Better training: more data, no NSP, dynamic masking
DistilBERT	6	12	768	66M	40% smaller, 60% faster, 97% of BERT quality via distillation
ALBERT	12	12	768	12M	Cross-layer weight sharing + factorized embeddings — tiny but effective
DeBERTa v3	12–24	12–16	768–1024	86M–900M	Disentangled attention (content + position separate) — SOTA encoder

Which BERT variant to use in 2025: For most classification/NER tasks: DeBERTa-v3-base (86M params) outperforms the original BERT-Large at a fraction of the size. For sentence embeddings and RAG: all-MiniLM-L6-v2 or all-mpnet-base-v2 (Sentence-BERT). For multilingual: mDeBERTa-v3-base covers 100 languages.

Fine-tuning BERT for downstream tasks

BERT's pretrained representations serve as a universal starting point for NLP tasks — just add a small task head and fine-tune for 2–5 epochs:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np

model_name = "distilbert-base-uncased"   # 66M params — fast, good quality
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2  # positive / negative
)

dataset = load_dataset("imdb")
tokenized = dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, max_length=512),
    batched=True
)

args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,          # Small LR — pretrained weights are delicate
    eval_strategy="epoch",
    fp16=True,
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"].shuffle(seed=42).select(range(5000)),
    eval_dataset=tokenized["test"].select(range(1000)),
    compute_metrics=lambda p: {
        "accuracy": (np.argmax(p.predictions, axis=1) == p.label_ids).mean()
    }
)
trainer.train()
# Typical result: ~93% accuracy on IMDb after 3 epochs, ~10 minutes on a T4 GPU

Encoder vs decoder models

Architecture	Examples	Attention pattern	Best for	Can generate text?
Encoder-only (BERT family)	BERT, RoBERTa, DeBERTa, ALBERT	Fully bidirectional — each token sees all others	Classification, NER, extractive QA, embeddings	❌ No
Decoder-only (GPT family)	GPT-4, LLaMA 3, Mistral, Claude	Causal — each token sees only past tokens	Text generation, chat, code, instruction following	✅ Yes
Encoder-decoder (T5/BART family)	T5, BART, FLAN-T5, mT5	Bidirectional encoder + causal decoder	Summarization, translation, abstractive QA	✅ Yes

Why decoder-only models won: GPT-style decoder-only models scaled better. With enough data and compute, they can handle understanding tasks too (via prompting), making separate encoder models less necessary. By 2023, instruction-tuned decoder-only LLMs (GPT-4, Claude, LLaMA) outperform BERT on most NLU tasks — despite BERT's architectural advantage for bidirectional understanding.

Sentence embeddings from BERT

Vanilla BERT produces poor sentence embeddings — averaging token vectors results in anisotropic representations (all embeddings clustered in a narrow cone). Sentence-BERT (SBERT) fixes this:

from sentence_transformers import SentenceTransformer, util
import torch

# Pretrained SBERT model fine-tuned specifically for semantic similarity
model = SentenceTransformer("all-MiniLM-L6-v2")  # 22M params, fast, great quality

# Encode sentences to fixed 384-dim vectors
sentences = [
    "What is the capital of France?",
    "Paris is the capital city of France.",
    "The Eiffel Tower is in Paris.",
    "Python is a programming language.",
]
embeddings = model.encode(sentences, convert_to_tensor=True)  # (4, 384)

# Semantic similarity (cosine)
sims = util.cos_sim(embeddings, embeddings)
print("Q vs Answer 1:", sims[0][1].item())   # → ~0.85 (high — semantically related)
print("Q vs Answer 2:", sims[0][2].item())   # → ~0.62 (medium)
print("Q vs Code:", sims[0][3].item())        # → ~0.10 (low — unrelated)

# For RAG: encode your document chunks once, store in vector DB
# At query time: encode query, find top-k by cosine similarity

Best embedding models in 2025: For English: text-embedding-3-large (OpenAI, 3072 dim), voyage-3 (Voyage AI), or all-mpnet-base-v2 (open source, 768 dim). For multilingual: multilingual-e5-large or mE5-mistral-7b (open source). For code: CodeBERT, UniXcoder. For production RAG: voyage-3 or text-embedding-3-large consistently top MTEB leaderboard.

Practice questions

What is masked language modeling (MLM) and how does it pre-train BERT? (Answer: MLM: randomly mask 15% of input tokens ([MASK]), train the model to predict the original token from context. Unlike causal LM (GPT), which only attends to previous tokens, MLM gives BERT bidirectional context — each token attends to all tokens. This enables BERT to learn deep bidirectional representations: predicting [MASK] in 'The [MASK] sat on the mat' uses both 'The' and 'on the mat'. The 15% masking: 80% replaced with [MASK], 10% random word, 10% unchanged — prevents the model from only learning [MASK] tokens.)
What is BERT's next sentence prediction (NSP) task and why was it later found to be unhelpful? (Answer: NSP: train BERT to classify whether sentence B follows sentence A in the original text (50% true pairs, 50% random). Intended to improve sentence-pair tasks (NLI, QA). RoBERTa (2019) ablated NSP and found removing it IMPROVED performance on most benchmarks. NSP was too easy — the model learned topic mismatch rather than real discourse understanding. Random sentence pairs often have different topics, making classification trivial. Modern BERT variants (RoBERTa, DeBERTa) drop NSP, training longer on MLM only.)
What is the difference between BERT-base and BERT-large, and when would you choose each? (Answer: BERT-base: 12 layers, 12 heads, 768 d_model, 110M parameters. BERT-large: 24 layers, 16 heads, 1024 d_model, 340M parameters. BERT-large scores ~2–4 points higher on GLUE benchmarks. BERT-base: preferred for production (2-3× faster inference, 3× less memory). BERT-large: preferred for research or when maximum accuracy matters and you have GPU budget. For most production NLP tasks (NER, classification, sentence similarity): BERT-base fine-tuned on domain data outperforms BERT-large on general data.)
What is the [CLS] token in BERT and how is it used for classification? (Answer: [CLS] (classification token): prepended to every input sequence. During pre-training, the hidden state of [CLS] at the final layer is used to predict NSP — so BERT is trained to aggregate sequence-level information into [CLS]. For downstream classification fine-tuning: take the [CLS] final hidden state (768-dim vector), add a linear classifier head on top, and fine-tune on labeled data. [CLS] acts as a sequence summary vector. Alternative: mean-pool all token embeddings — often performs similarly or better for sentence similarity tasks (Sentence-BERT).)
What is DeBERTa and what two innovations improved on BERT's attention? (Answer: DeBERTa (He et al., Microsoft 2020): (1) Disentangled attention: content and position embeddings are separate — attention between tokens computed using 4 terms: content-to-content, content-to-position, position-to-content, position-to-position. Better position-aware representations. (2) Enhanced mask decoder (EMD): uses absolute position information in the final decoding layers while using relative position in attention. DeBERTa-v3 (186M params) outperforms BERT-large (340M) and RoBERTa-large (355M) on GLUE/SuperGLUE while being smaller and faster.)

Definition

BERT's pretraining objectives

BERT was the first model to demonstrate that bidirectional pretraining dramatically outperforms left-to-right language models on understanding tasks. It uses two self-supervised objectives:

Objective	What it does	Example	Why it matters
Masked Language Modeling (MLM)	Randomly mask 15% of tokens; predict from context	"The [MASK] sat on the mat" → "cat"	Forces model to use both left AND right context — bidirectional understanding
Next Sentence Prediction (NSP)	Predict if sentence B follows sentence A	IsNextSentence or NotNextSentence	Intended for sentence-pair tasks (QA, entailment) — later shown less useful

BERT masked language modeling inference

from transformers import pipeline, AutoTokenizer, AutoModelForMaskedLM
import torch

# High-level: use fill-mask pipeline
mlm = pipeline("fill-mask", model="bert-base-uncased")
results = mlm("The [MASK] sat on the mat.")
for r in results[:3]:
    print(f"{r['token_str']:12s} ({r['score']:.3f})")
# cat        (0.812)
# dog        (0.043)
# man        (0.018)

# Lower-level: get BERT hidden states (for downstream use)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

inputs = tokenizer("The quick brown fox", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Last hidden state: (batch=1, seq_len, hidden=768)
hidden_states = outputs.hidden_states[-1]
cls_embedding = hidden_states[:, 0, :]   # [CLS] token for classification tasks
token_embeddings = hidden_states[:, 1:-1, :]  # word tokens (no special tokens)

MLM masking strategy

Of the 15% selected tokens: 80% are replaced with [MASK], 10% replaced with a random token, 10% kept unchanged. The 10% random and 10% unchanged prevent the model from learning to only predict masked positions — it must maintain good representations for all tokens.

BERT architecture and variants

Model	Layers	Heads	Hidden dim	Params	Key improvement
BERT-Base	12	12	768	110M	Original — bidirectional encoder baseline
BERT-Large	24	16	1024	340M	Larger, better — expensive to fine-tune
RoBERTa	12–24	12–16	768–1024	125M–355M	Better training: more data, no NSP, dynamic masking
DistilBERT	6	12	768	66M	40% smaller, 60% faster, 97% of BERT quality via distillation
ALBERT	12	12	768	12M	Cross-layer weight sharing + factorized embeddings — tiny but effective
DeBERTa v3	12–24	12–16	768–1024	86M–900M	Disentangled attention (content + position separate) — SOTA encoder

Which BERT variant to use in 2025

For most classification/NER tasks: DeBERTa-v3-base (86M params) outperforms the original BERT-Large at a fraction of the size. For sentence embeddings and RAG: all-MiniLM-L6-v2 or all-mpnet-base-v2 (Sentence-BERT). For multilingual: mDeBERTa-v3-base covers 100 languages.

Fine-tuning BERT for downstream tasks

BERT's pretrained representations serve as a universal starting point for NLP tasks — just add a small task head and fine-tune for 2–5 epochs:

Fine-tuning BERT for text classification (sentiment analysis)

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np

model_name = "distilbert-base-uncased"   # 66M params — fast, good quality
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2  # positive / negative
)

dataset = load_dataset("imdb")
tokenized = dataset.map(
    lambda x: tokenizer(x["text"], truncation=True, max_length=512),
    batched=True
)

args = TrainingArguments(
    output_dir="./bert-imdb",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    learning_rate=2e-5,          # Small LR — pretrained weights are delicate
    eval_strategy="epoch",
    fp16=True,
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"].shuffle(seed=42).select(range(5000)),
    eval_dataset=tokenized["test"].select(range(1000)),
    compute_metrics=lambda p: {
        "accuracy": (np.argmax(p.predictions, axis=1) == p.label_ids).mean()
    }
)
trainer.train()
# Typical result: ~93% accuracy on IMDb after 3 epochs, ~10 minutes on a T4 GPU

Encoder vs decoder models

Architecture	Examples	Attention pattern	Best for	Can generate text?
Encoder-only (BERT family)	BERT, RoBERTa, DeBERTa, ALBERT	Fully bidirectional — each token sees all others	Classification, NER, extractive QA, embeddings	❌ No
Decoder-only (GPT family)	GPT-4, LLaMA 3, Mistral, Claude	Causal — each token sees only past tokens	Text generation, chat, code, instruction following	✅ Yes
Encoder-decoder (T5/BART family)	T5, BART, FLAN-T5, mT5	Bidirectional encoder + causal decoder	Summarization, translation, abstractive QA	✅ Yes

Why decoder-only models won

GPT-style decoder-only models scaled better. With enough data and compute, they can handle understanding tasks too (via prompting), making separate encoder models less necessary. By 2023, instruction-tuned decoder-only LLMs (GPT-4, Claude, LLaMA) outperform BERT on most NLU tasks — despite BERT's architectural advantage for bidirectional understanding.

Sentence embeddings from BERT

Vanilla BERT produces poor sentence embeddings — averaging token vectors results in anisotropic representations (all embeddings clustered in a narrow cone). Sentence-BERT (SBERT) fixes this:

SBERT sentence embeddings for semantic similarity and RAG

from sentence_transformers import SentenceTransformer, util
import torch

# Pretrained SBERT model fine-tuned specifically for semantic similarity
model = SentenceTransformer("all-MiniLM-L6-v2")  # 22M params, fast, great quality

# Encode sentences to fixed 384-dim vectors
sentences = [
    "What is the capital of France?",
    "Paris is the capital city of France.",
    "The Eiffel Tower is in Paris.",
    "Python is a programming language.",
]
embeddings = model.encode(sentences, convert_to_tensor=True)  # (4, 384)

# Semantic similarity (cosine)
sims = util.cos_sim(embeddings, embeddings)
print("Q vs Answer 1:", sims[0][1].item())   # → ~0.85 (high — semantically related)
print("Q vs Answer 2:", sims[0][2].item())   # → ~0.62 (medium)
print("Q vs Code:", sims[0][3].item())        # → ~0.10 (low — unrelated)

# For RAG: encode your document chunks once, store in vector DB
# At query time: encode query, find top-k by cosine similarity

Best embedding models in 2025

For English: text-embedding-3-large (OpenAI, 3072 dim), voyage-3 (Voyage AI), or all-mpnet-base-v2 (open source, 768 dim). For multilingual: multilingual-e5-large or mE5-mistral-7b (open source). For code: CodeBERT, UniXcoder. For production RAG: voyage-3 or text-embedding-3-large consistently top MTEB leaderboard.

Practice questions

What is masked language modeling (MLM) and how does it pre-train BERT? (Answer: MLM: randomly mask 15% of input tokens ([MASK]), train the model to predict the original token from context. Unlike causal LM (GPT), which only attends to previous tokens, MLM gives BERT bidirectional context — each token attends to all tokens. This enables BERT to learn deep bidirectional representations: predicting [MASK] in 'The [MASK] sat on the mat' uses both 'The' and 'on the mat'. The 15% masking: 80% replaced with [MASK], 10% random word, 10% unchanged — prevents the model from only learning [MASK] tokens.)
What is BERT's next sentence prediction (NSP) task and why was it later found to be unhelpful? (Answer: NSP: train BERT to classify whether sentence B follows sentence A in the original text (50% true pairs, 50% random). Intended to improve sentence-pair tasks (NLI, QA). RoBERTa (2019) ablated NSP and found removing it IMPROVED performance on most benchmarks. NSP was too easy — the model learned topic mismatch rather than real discourse understanding. Random sentence pairs often have different topics, making classification trivial. Modern BERT variants (RoBERTa, DeBERTa) drop NSP, training longer on MLM only.)
What is the difference between BERT-base and BERT-large, and when would you choose each? (Answer: BERT-base: 12 layers, 12 heads, 768 d_model, 110M parameters. BERT-large: 24 layers, 16 heads, 1024 d_model, 340M parameters. BERT-large scores ~2–4 points higher on GLUE benchmarks. BERT-base: preferred for production (2-3× faster inference, 3× less memory). BERT-large: preferred for research or when maximum accuracy matters and you have GPU budget. For most production NLP tasks (NER, classification, sentence similarity): BERT-base fine-tuned on domain data outperforms BERT-large on general data.)
What is the [CLS] token in BERT and how is it used for classification? (Answer: [CLS] (classification token): prepended to every input sequence. During pre-training, the hidden state of [CLS] at the final layer is used to predict NSP — so BERT is trained to aggregate sequence-level information into [CLS]. For downstream classification fine-tuning: take the [CLS] final hidden state (768-dim vector), add a linear classifier head on top, and fine-tune on labeled data. [CLS] acts as a sequence summary vector. Alternative: mean-pool all token embeddings — often performs similarly or better for sentence similarity tasks (Sentence-BERT).)
What is DeBERTa and what two innovations improved on BERT's attention? (Answer: DeBERTa (He et al., Microsoft 2020): (1) Disentangled attention: content and position embeddings are separate — attention between tokens computed using 4 terms: content-to-content, content-to-position, position-to-content, position-to-position. Better position-aware representations. (2) Enhanced mask decoder (EMD): uses absolute position information in the final decoding layers while using relative position in attention. DeBERTa-v3 (186M params) outperforms BERT-large (340M) and RoBERTa-large (355M) on GLUE/SuperGLUE while being smaller and faster.)

BERT & Encoder-Only Models

BERT's pretraining objectives

BERT architecture and variants

Fine-tuning BERT for downstream tasks

Encoder vs decoder models

Sentence embeddings from BERT

Practice questions

BERT & Encoder-Only Models

BERT's pretraining objectives

BERT architecture and variants

Fine-tuning BERT for downstream tasks

Encoder vs decoder models

Sentence embeddings from BERT

Practice questions

Practice what you just learned

Related Terms