What is gPT architecture vs BERT architecture?

GPT & Decoder-Only Language Models: GPT architecture vs BERT architecture. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/gpt-decoder-models

What is scaling laws and the GPT evolution?

GPT & Decoder-Only Language Models: Scaling laws and the GPT evolution. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/gpt-decoder-models

What is practice questions?

GPT & Decoder-Only Language Models: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/gpt-decoder-models

GPT & Decoder-Only Language Models

GPT (Generative Pre-trained Transformer) family uses decoder-only transformer architecture with causal (unidirectional) self-attention. Pre-trained on next-token prediction across massive text corpora, then fine-tuned for downstream tasks. GPT-1 (2018) introduced the paradigm; GPT-2 (2019) demonstrated emergent generation quality; GPT-3 (2020, 175B params) showed few-shot learning; GPT-4 (2023) achieved near-human performance on many benchmarks. The same architecture powers Claude (Anthropic), Llama (Meta), Gemini (Google), and Mistral — decoder-only transformers are the dominant paradigm for foundation models.

The auto-regressive architecture behind ChatGPT, Claude, Llama, and modern AI assistants.

Category: Natural Language Processing

GPT architecture vs BERT architecture

Property	GPT (Decoder-only)	BERT (Encoder-only)
Attention pattern	Causal (triangular) — can only see past tokens	Bidirectional — sees all tokens
Pre-training objective	Causal LM: predict next token (P(wₜ \| w₁..wₜ₋₁))	Masked LM: predict 15% masked tokens
Token representation	Each token sees only left context	Each token sees full sentence context
Good for	Generation, completion, chat, code	Classification, NER, QA, understanding
Output layer	Vocabulary head → next token probabilities	Task-specific head (classifier, span predictor)
Examples	GPT-4, Claude, Llama-3, Gemini, Mistral	BERT, RoBERTa, DistilBERT, ELECTRA

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import torch

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model     = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# ── Manual generation: greedy decoding ──
def greedy_generate(prompt: str, max_new: int = 30) -> str:
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new,
            do_sample=False,          # Greedy: always pick highest probability token
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# ── Sampling strategies ──
def sample_generate(prompt: str, strategy: str = 'top_p') -> str:
    inputs = tokenizer(prompt, return_tensors='pt')
    with torch.no_grad():
        if strategy == 'top_k':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, do_sample=True,
                top_k=50,             # Sample from top-50 tokens
                temperature=0.8,
                pad_token_id=tokenizer.eos_token_id)
        elif strategy == 'top_p':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, do_sample=True,
                top_p=0.92,           # Nucleus sampling: smallest set with 92% probability mass
                temperature=0.8,
                pad_token_id=tokenizer.eos_token_id)
        elif strategy == 'beam':
            output_ids = model.generate(
                **inputs, max_new_tokens=50, num_beams=5,  # Beam search: top-5 sequences
                early_stopping=True,
                pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

prompt = "The history of artificial intelligence began"
print("Greedy:", greedy_generate(prompt, 20))
print("Top-K: ", sample_generate(prompt, 'top_k'))
print("Top-p: ", sample_generate(prompt, 'top_p'))
print("Beam:  ", sample_generate(prompt, 'beam'))

# ── Logits and next-token distribution ──
inputs = tokenizer("The capital of France is", return_tensors='pt')
with torch.no_grad():
    logits = model(**inputs).logits    # (1, seq_len, vocab_size)

# Probability distribution over next token
next_token_logits = logits[0, -1, :]  # Last token's prediction
probs = torch.softmax(next_token_logits, dim=-1)
top5  = probs.topk(5)
print("
Top-5 next token probabilities:")
for prob, idx in zip(top5.values, top5.indices):
    token = tokenizer.decode([idx])
    print(f"  {prob:.3f}: '{token}'")
# ' Paris' should be highest probability

Scaling laws and the GPT evolution

Model	Year	Parameters	Training data	Key capability
GPT-1	2018	117M	7GB BooksCorpus	First large-scale LLM, zero-shot basics
GPT-2	2019	1.5B	40GB WebText	Coherent long-form text generation
GPT-3	2020	175B	570GB filtered web	In-context few-shot learning, instruction following
InstructGPT	2022	175B	GPT-3 + RLHF	Aligned, helpful assistant behavior
ChatGPT/GPT-4	2023	>1T (est)	Multi-modal massive scale	Near-human performance across domains

Emergent capabilities at scale: GPT-3 and larger models exhibit emergent capabilities — abilities that appeared unpredictably at scale and were not present in smaller models: multi-step arithmetic, code generation, analogical reasoning, and in-context few-shot learning. This is why the field shifted from task-specific models to scaling general-purpose LLMs.

Practice questions

Why can GPT generate text but BERT cannot? (Answer: GPT uses causal attention — each token only attends to previous tokens, enabling left-to-right generation. At each step, GPT predicts the next token from all previous ones. BERT uses bidirectional attention that requires the full sequence — you cannot generate token-by-token because each token's representation depends on all future tokens.)
What is top-p (nucleus) sampling and why is it preferred over top-k? (Answer: Top-p samples from the smallest vocabulary subset whose cumulative probability exceeds p (e.g., 0.9). The number of tokens considered varies dynamically — large for uncertain predictions, small for confident ones. Top-k always samples from k tokens regardless of confidence level — can include many low-probability tokens when k is large or be over-restrictive when k is small.)
GPT-3 demonstrates "in-context few-shot learning." What does this mean? (Answer: You provide a few examples in the prompt (e.g., 2-3 input-output pairs) and GPT-3 generalizes the pattern to new inputs — WITHOUT any gradient updates or fine-tuning. The model learns from context at inference time. This is qualitatively different from traditional ML which requires labeled training data.)
What is temperature in LLM sampling and what happens with temperature=0? (Answer: Temperature scales logits before softmax: logits/T. T=1: standard distribution. T<1: sharper — model becomes more confident, less diverse. T=0: equivalent to greedy decoding (always picks highest probability token). T>1: flatter distribution, more random/creative. T=0 for factual tasks, T=0.7-1.0 for creative writing.)
What is the difference between beam search and greedy decoding? (Answer: Greedy: always picks the single highest-probability next token. Local optimal but not globally optimal. Beam search: tracks top-k complete sequences simultaneously. At each step, expands all k beams and keeps the k best overall sequences. Finds better overall sequences at cost of more computation. num_beams=5 is common.)

Claude is a decoder-only transformer — every response is generated by predicting the next token, then the next, until the response is complete. The sampling strategy, temperature, and beam width LumiChats uses can be configured for different use cases: deterministic (temperature=0) for code, creative (temperature=0.9) for writing.

Property

GPT (Decoder-only)

BERT (Encoder-only)

Attention pattern

Causal (triangular) — can only see past tokens

Bidirectional — sees all tokens

Pre-training objective

Causal LM: predict next token (P(wₜ | w₁..wₜ₋₁))

Masked LM: predict 15% masked tokens

Token representation

Each token sees only left context

Each token sees full sentence context

Good for

Generation, completion, chat, code

Classification, NER, QA, understanding

Output layer

Vocabulary head → next token probabilities

Task-specific head (classifier, span predictor)

Examples

GPT-4, Claude, Llama-3, Gemini, Mistral

BERT, RoBERTa, DistilBERT, ELECTRA

from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline import torch tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2') model.eval() # ── Manual generation: greedy decoding ── def greedy_generate(prompt: str, max_new: int = 30) -> str: inputs = tokenizer(prompt, return_tensors='pt') with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=max_new, do_sample=False, # Greedy: always pick highest probability token pad_token_id=tokenizer.eos_token_id ) return tokenizer.decode(output_ids[0], skip_special_tokens=True) # ── Sampling strategies ── def sample_generate(prompt: str, strategy: str = 'top_p') -> str: inputs = tokenizer(prompt, return_tensors='pt') with torch.no_grad(): if strategy == 'top_k': output_ids = model.generate( **inputs, max_new_tokens=50, do_sample=True, top_k=50, # Sample from top-50 tokens temperature=0.8, pad_token_id=tokenizer.eos_token_id) elif strategy == 'top_p': output_ids = model.generate( **inputs, max_new_tokens=50, do_sample=True, top_p=0.92, # Nucleus sampling: smallest set with 92% probability mass temperature=0.8, pad_token_id=tokenizer.eos_token_id) elif strategy == 'beam': output_ids = model.generate( **inputs, max_new_tokens=50, num_beams=5, # Beam search: top-5 sequences early_stopping=True, pad_token_id=tokenizer.eos_token_id) return tokenizer.decode(output_ids[0], skip_special_tokens=True) prompt = "The history of artificial intelligence began" print("Greedy:", greedy_generate(prompt, 20)) print("Top-K: ", sample_generate(prompt, 'top_k')) print("Top-p: ", sample_generate(prompt, 'top_p')) print("Beam: ", sample_generate(prompt, 'beam')) # ── Logits and next-token distribution ── inputs = tokenizer("The capital of France is", return_tensors='pt') with torch.no_grad(): logits = model(**inputs).logits # (1, seq_len, vocab_size) # Probability distribution over next token next_token_logits = logits[0, -1, :] # Last token's prediction probs = torch.softmax(next_token_logits, dim=-1) top5 = probs.topk(5) print(" Top-5 next token probabilities:") for prob, idx in zip(top5.values, top5.indices): token = tokenizer.decode([idx]) print(f" {prob:.3f}: '{token}'") # ' Paris' should be highest probability

Model

Year

Parameters

Training data

Key capability

GPT-1

2018

117M

7GB BooksCorpus

First large-scale LLM, zero-shot basics

GPT-2

2019

1.5B

40GB WebText

Coherent long-form text generation

GPT-3

2020

175B

570GB filtered web

In-context few-shot learning, instruction following

InstructGPT

2022

175B

GPT-3 + RLHF

Aligned, helpful assistant behavior

ChatGPT/GPT-4

2023

>1T (est)

Multi-modal massive scale

Near-human performance across domains

GPT & Decoder-Only Language Models

GPT architecture vs BERT architecture

Scaling laws and the GPT evolution

Practice questions

GPT & Decoder-Only Language Models

GPT architecture vs BERT architecture

Scaling laws and the GPT evolution

Practice questions

Practice what you just learned

Related Terms