GPT (Generative Pre-trained Transformer) family uses decoder-only transformer architecture with causal (unidirectional) self-attention. Pre-trained on next-token prediction across massive text corpora, then fine-tuned for downstream tasks. GPT-1 (2018) introduced the paradigm; GPT-2 (2019) demonstrated emergent generation quality; GPT-3 (2020, 175B params) showed few-shot learning; GPT-4 (2023) achieved near-human performance on many benchmarks. The same architecture powers Claude (Anthropic), Llama (Meta), Gemini (Google), and Mistral — decoder-only transformers are the dominant paradigm for foundation models.
GPT architecture vs BERT architecture
| Property | GPT (Decoder-only) | BERT (Encoder-only) |
|---|---|---|
| Attention pattern | Causal (triangular) — can only see past tokens | Bidirectional — sees all tokens |
| Pre-training objective | Causal LM: predict next token (P(wₜ | w₁..wₜ₋₁)) | Masked LM: predict 15% masked tokens |
| Token representation | Each token sees only left context | Each token sees full sentence context |
| Good for | Generation, completion, chat, code | Classification, NER, QA, understanding |
| Output layer | Vocabulary head → next token probabilities | Task-specific head (classifier, span predictor) |
| Examples | GPT-4, Claude, Llama-3, Gemini, Mistral | BERT, RoBERTa, DistilBERT, ELECTRA |
GPT-2 text generation with sampling strategies
from transformers import GPT2LMHeadModel, GPT2Tokenizer, pipeline
import torch
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
# ── Manual generation: greedy decoding ──
def greedy_generate(prompt: str, max_new: int = 30) -> str:
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=max_new,
do_sample=False, # Greedy: always pick highest probability token
pad_token_id=tokenizer.eos_token_id
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
# ── Sampling strategies ──
def sample_generate(prompt: str, strategy: str = 'top_p') -> str:
inputs = tokenizer(prompt, return_tensors='pt')
with torch.no_grad():
if strategy == 'top_k':
output_ids = model.generate(
**inputs, max_new_tokens=50, do_sample=True,
top_k=50, # Sample from top-50 tokens
temperature=0.8,
pad_token_id=tokenizer.eos_token_id)
elif strategy == 'top_p':
output_ids = model.generate(
**inputs, max_new_tokens=50, do_sample=True,
top_p=0.92, # Nucleus sampling: smallest set with 92% probability mass
temperature=0.8,
pad_token_id=tokenizer.eos_token_id)
elif strategy == 'beam':
output_ids = model.generate(
**inputs, max_new_tokens=50, num_beams=5, # Beam search: top-5 sequences
early_stopping=True,
pad_token_id=tokenizer.eos_token_id)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
prompt = "The history of artificial intelligence began"
print("Greedy:", greedy_generate(prompt, 20))
print("Top-K: ", sample_generate(prompt, 'top_k'))
print("Top-p: ", sample_generate(prompt, 'top_p'))
print("Beam: ", sample_generate(prompt, 'beam'))
# ── Logits and next-token distribution ──
inputs = tokenizer("The capital of France is", return_tensors='pt')
with torch.no_grad():
logits = model(**inputs).logits # (1, seq_len, vocab_size)
# Probability distribution over next token
next_token_logits = logits[0, -1, :] # Last token's prediction
probs = torch.softmax(next_token_logits, dim=-1)
top5 = probs.topk(5)
print("
Top-5 next token probabilities:")
for prob, idx in zip(top5.values, top5.indices):
token = tokenizer.decode([idx])
print(f" {prob:.3f}: '{token}'")
# ' Paris' should be highest probabilityScaling laws and the GPT evolution
| Model | Year | Parameters | Training data | Key capability |
|---|---|---|---|---|
| GPT-1 | 2018 | 117M | 7GB BooksCorpus | First large-scale LLM, zero-shot basics |
| GPT-2 | 2019 | 1.5B | 40GB WebText | Coherent long-form text generation |
| GPT-3 | 2020 | 175B | 570GB filtered web | In-context few-shot learning, instruction following |
| InstructGPT | 2022 | 175B | GPT-3 + RLHF | Aligned, helpful assistant behaviour |
| ChatGPT/GPT-4 | 2023 | >1T (est) | Multi-modal massive scale | Near-human performance across domains |
Emergent capabilities at scale
GPT-3 and larger models exhibit emergent capabilities — abilities that appeared unpredictably at scale and were not present in smaller models: multi-step arithmetic, code generation, analogical reasoning, and in-context few-shot learning. This is why the field shifted from task-specific models to scaling general-purpose LLMs.
Practice questions
- Why can GPT generate text but BERT cannot? (Answer: GPT uses causal attention — each token only attends to previous tokens, enabling left-to-right generation. At each step, GPT predicts the next token from all previous ones. BERT uses bidirectional attention that requires the full sequence — you cannot generate token-by-token because each token's representation depends on all future tokens.)
- What is top-p (nucleus) sampling and why is it preferred over top-k? (Answer: Top-p samples from the smallest vocabulary subset whose cumulative probability exceeds p (e.g., 0.9). The number of tokens considered varies dynamically — large for uncertain predictions, small for confident ones. Top-k always samples from k tokens regardless of confidence level — can include many low-probability tokens when k is large or be over-restrictive when k is small.)
- GPT-3 demonstrates "in-context few-shot learning." What does this mean? (Answer: You provide a few examples in the prompt (e.g., 2-3 input-output pairs) and GPT-3 generalises the pattern to new inputs — WITHOUT any gradient updates or fine-tuning. The model learns from context at inference time. This is qualitatively different from traditional ML which requires labeled training data.)
- What is temperature in LLM sampling and what happens with temperature=0? (Answer: Temperature scales logits before softmax: logits/T. T=1: standard distribution. T<1: sharper — model becomes more confident, less diverse. T=0: equivalent to greedy decoding (always picks highest probability token). T>1: flatter distribution, more random/creative. T=0 for factual tasks, T=0.7-1.0 for creative writing.)
- What is the difference between beam search and greedy decoding? (Answer: Greedy: always picks the single highest-probability next token. Local optimal but not globally optimal. Beam search: tracks top-k complete sequences simultaneously. At each step, expands all k beams and keeps the k best overall sequences. Finds better overall sequences at cost of more computation. num_beams=5 is common.)
On LumiChats
Claude is a decoder-only transformer — every response is generated by predicting the next token, then the next, until the response is complete. The sampling strategy, temperature, and beam width LumiChats uses can be configured for different use cases: deterministic (temperature=0) for code, creative (temperature=0.9) for writing.
Try it free