A Large Language Model (LLM) is an AI system trained on massive amounts of text data to understand and generate human language. LLMs power tools like ChatGPT, Claude, Gemini, and LumiChats. They are built on the Transformer architecture and contain billions to trillions of learnable parameters.
How LLMs work
LLMs are trained with a deceptively simple objective: predict the next token. Given a sequence of tokens x₁, x₂, …, xₙ, the model computes the probability of each possible next token:
Cross-entropy training loss: the model is penalized for assigning low probability to the correct next token. Minimizing this over trillions of tokens teaches the model language structure, facts, and reasoning.
Over trillions of such predictions (GPT-3 trained on ~300 billion tokens; LLaMA 3 on 15 trillion), the model's billions of parameters are adjusted via backpropagation and Adam to reduce this loss — implicitly learning grammar, facts, reasoning, and code. At inference time, tokens are generated one at a time, each sampled from the predicted probability distribution.
Autoregressive generation: how LLMs produce text token by token
import torch
import torch.nn.functional as F
def generate(model, tokenizer, prompt: str, max_new_tokens=50, temperature=1.0):
"""Minimal autoregressive generation loop — this is what every LLM inference does."""
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_new_tokens):
with torch.no_grad():
logits = model(input_ids).logits # shape: (1, seq_len, vocab_size)
# Take logits for the LAST token position only
next_token_logits = logits[:, -1, :] / temperature # (1, vocab_size)
# Convert to probabilities and sample
probs = F.softmax(next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1) # (1, 1)
# Append to sequence and continue
input_ids = torch.cat([input_ids, next_token], dim=1)
# Stop at end-of-sequence token
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0], skip_special_tokens=True)Self-supervised learning
No human labels are needed — the training signal comes from the text itself. Every document on the internet is a self-labeled training example: the model predicts each word from the preceding words. This is why web-scale training is possible.
What 'large' means: scaling laws
The 'large' in LLM refers to parameter count — but the relationship between scale and capability is governed by precise empirical laws. Kaplan et al. (OpenAI, 2020) showed that loss follows a power law in model size N, dataset size D, and compute C:
Kaplan scaling laws: loss decreases as a power law with model size and data size. αN ≈ 0.076, αD ≈ 0.095 from the original paper.
Hoffmann et al. (DeepMind, 2022) — the 'Chinchilla' paper — showed the original laws were suboptimal: for a given compute budget, the optimal model is smaller and trained on more data than previously thought. The Chinchilla-optimal ratio: ~20 tokens of training data per parameter.
| Model | Parameters | Training tokens | Organization | Year |
|---|---|---|---|---|
| GPT-2 | 1.5B | 40B | OpenAI | 2019 |
| GPT-3 | 175B | 300B | OpenAI | 2020 |
| Chinchilla | 70B | 1.4T | DeepMind | 2022 |
| LLaMA 3 | 8B / 70B / 405B | 15T | Meta | 2024 |
| GPT-4 (est.) | ~1T (MoE) | ~13T | OpenAI | 2023 |
Chinchilla insight
A 70B model trained on 1.4T tokens outperforms a 280B model trained on the same compute budget on fewer tokens. More data, smaller model — same cost, better results. This insight drove the design of LLaMA 3 (15T tokens) and Mistral.
Why models differ
Not all LLMs are equal. Four axes explain most performance differences:
| Dimension | What it affects | Examples |
|---|---|---|
| Training data | Knowledge coverage, language support, code/math ability | Common Crawl, GitHub, arXiv, Wikipedia, books |
| Training compute | Model size, data volume — bounded by GPU budget | GPT-4 training: est. $50–100M. LLaMA 3 405B: est. $10M+ |
| Architecture choices | Context length, efficiency, specialization | MoE vs dense; RoPE vs ALiBi; GQA vs MHA |
| Post-training | Alignment, safety, instruction-following | RLHF, DPO, Constitutional AI, RLAIF |
These differences explain observed specializations: Claude excels at nuanced reasoning, instruction-following, and safety (Constitutional AI + RLHF). DeepSeek and Qwen lead on code and math (heavy GitHub + math dataset training). Gemini leads on multimodal and long-context tasks (1M+ token context, native video training). Mistral leads on efficiency (GQA, sliding window attention for smaller models).
Model card
Every serious LLM release includes a model card specifying training data sources, known limitations, intended use, and safety evaluations. Always check the model card before deploying in a production application.
The pretraining → post-training pipeline
Modern LLMs are built in two major phases with distinct objectives and costs:
- Pretraining: Self-supervised next-token prediction on a massive corpus (trillions of tokens). Trains for weeks to months on thousands of GPUs. Cost: $1M–$100M+. Produces a base model that can complete text but doesn't follow instructions or refuse harmful requests.
- Post-training: (a) Supervised Fine-Tuning (SFT) on (instruction, response) demonstration pairs — teaches instruction-following. (b) RLHF or DPO on human preference data — teaches helpfulness, safety, and alignment. Total cost: $1K–$10M. Produces the conversational model you interact with.
Pretraining data mix — how training corpora are assembled (LLaMA 3 style)
# LLaMA 3 training data composition (approximate, based on Meta's paper)
TRAINING_MIX = {
"general_web": {"weight": 0.50, "source": "Common Crawl (cleaned)"},
"code": {"weight": 0.17, "source": "GitHub, Stack Overflow"},
"books": {"weight": 0.10, "source": "Books, long-form text"},
"scientific": {"weight": 0.08, "source": "arXiv, PubMed, academic papers"},
"wikipedia": {"weight": 0.05, "source": "Wikipedia (multilingual)"},
"math": {"weight": 0.05, "source": "Math Stack Exchange, proofs"},
"multilingual": {"weight": 0.05, "source": "CC news in 30+ languages"},
}
total_tokens = 15e12 # 15 trillion tokens
for name, info in TRAINING_MIX.items():
tokens = total_tokens * info["weight"]
print(f"{name:20s}: {tokens/1e12:.2f}T tokens ({info['source']})")Why data quality dominates
The biggest gains in recent models (Phi-3, Mistral, Qwen) come not from scale but from data curation. Phi-3-mini (3.8B parameters) matches GPT-3.5 quality by training on high-quality "textbook-style" synthetic data rather than raw web dumps.
Limitations
Understanding LLM limitations is essential for safe deployment:
| Limitation | Root cause | Mitigation |
|---|---|---|
| Knowledge cutoff | Training data has a fixed end date | RAG, web search tools, retrieval augmentation |
| Hallucination | Generates plausible text, not verified facts | RAG, citations, tool use, Constitutional AI |
| No persistent memory | Stateless — each conversation starts fresh | External memory systems, database-backed retrieval |
| Context window limits | Attention is O(n²) in compute/memory | Larger context models, RAG, summarization |
| Inference cost | Billions of FLOPs per token generated | Quantization, speculative decoding, caching |
| No real-time action | Cannot browse web, run code natively | Tool use, code interpreters, agentic frameworks |
The hallucination trap
LLMs cannot distinguish between "I know this" and "this sounds plausible." They generate fluent, confident text even when wrong. Always verify factual claims — especially statistics, citations, and specific dates — from primary sources.
Practice questions
- What is an LLM's context window and what happens when content exceeds it? (Answer: Context window = the maximum number of tokens the model can process in one forward pass (input + output combined). Claude 3.5: 200K tokens. GPT-4o: 128K. LLaMA 3.1: 128K. When content exceeds the limit: API throws a context_length_exceeded error (you must chunk or summarise). Within the window: all tokens receive full attention — no information is lost. Common workaround: sliding window chunking with overlap, RAG (retrieve only relevant portions), or summarisation of earlier context.)
- Why do LLMs sometimes confidently state false information (hallucinate)? (Answer: LLMs are trained to predict the next most probable token given context — not to verify factual accuracy. They have no internal fact-checking mechanism. Hallucination occurs when: (1) the fact was rare or absent in training data, (2) the model interpolates between related facts incorrectly, (3) the model confabulates plausible-sounding completions when uncertain. Mitigation: Retrieval-Augmented Generation (RAG) grounds responses in verified sources; Constitutional AI training reduces confident false statements.)
- What is the difference between a base model and an instruction-tuned model? (Answer: Base model (pretrained): trained to predict next tokens on massive text corpora — continues text in the style of its training data. Chatting with a base model: it completes your message as if continuing a document, not responding helpfully. Instruction-tuned (chat model): fine-tuned on (instruction, response) pairs + RLHF — trained to be a helpful assistant that follows directions. All deployed products (ChatGPT, Claude, Gemini) are instruction-tuned. Base models are used for research and further fine-tuning.)
- How do LLMs achieve few-shot learning from examples in the prompt without weight updates? (Answer: In-context learning: the model uses the examples in the prompt as implicit task specification. The transformer's attention mechanism can identify the pattern across examples and apply it to the new input. This works because during pretraining, the model saw countless documents that implicitly demonstrated tasks through examples. Few-shot ICL is thus a form of pattern matching using the model's pretrained world knowledge rather than gradient-based adaptation.)
- What is the difference between LLM fine-tuning and prompting for a specific task, and when should you use each? (Answer: Prompting: craft instructions/examples in the prompt to guide the model. Zero cost, instant to change, works well for tasks the model already has capability for. Fine-tuning: train model weights on task-specific data. Higher upfront cost, produces consistent behaviour, better for tasks requiring style/format consistency, domain-specific knowledge, or reliably following complex instructions. Rule of thumb: try prompting first. If you hit consistent failure modes despite good prompting, and you have 100+ quality examples, fine-tune.)
On LumiChats
LumiChats gives you access to 39+ LLMs — including GPT-4o, Claude Sonnet, Gemini Pro, DeepSeek V3, Qwen, and Mistral — in a single platform. Switching between models lets you use the best LLM for each specific task.
Try it free