Neural scaling laws are empirical power-law relationships between the performance of neural networks and the scale of their training — measured by model parameters (N), training tokens (D), and compute budget (C). Discovered by Kaplan et al. at OpenAI in 2020 and refined by Hoffmann et al. at DeepMind (the 'Chinchilla' paper) in 2022, scaling laws allow AI labs to predict model performance at larger scales without training full models — guiding decisions about how to allocate compute between model size and data volume. Scaling laws are the primary reason AI labs confidently invest hundreds of millions of dollars in training runs: they can predict the outcome before starting.
The Kaplan and Chinchilla scaling laws
Unified scaling law (Kaplan et al., 2020): loss L as a function of model parameters N and training tokens D. αN ≈ 0.076, αD ≈ 0.095. L∞ is the irreducible loss (human-level performance floor).
The Kaplan laws suggested that at a fixed compute budget, you should scale model size faster than data. The 2022 Chinchilla paper (Hoffmann et al.) showed this was wrong: the optimal allocation is equal scaling of parameters and training tokens. The Chinchilla-optimal recipe: for each 10× increase in compute, increase model parameters 3.1× and increase training tokens 3.1×. A model trained with Chinchilla-optimal allocation will always outperform a larger model trained on fewer tokens for the same total compute.
| Model | Parameters | Training tokens | Compute (FLOP) | Chinchilla optimal? |
|---|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | 3.1×10²³ | No — undertrained (Kaplan recipe) |
| Chinchilla (2022) | 70B | 1.4T | 5.8×10²³ | Yes — set the standard |
| LLaMA 2 (2023) | 70B | 2T | ~7×10²³ | Over-trained (more tokens than Chinchilla-optimal) |
| LLaMA 3 (2024) | 70B / 405B | 15T / 15T | ~3×10²⁴ | Intentionally over-trained for inference efficiency |
| GPT-4 (2023, estimated) | ~1T (MoE) | ~13T | ~2×10²⁵ | Approximately optimal for its compute budget |
Why LLaMA 3 intentionally over-trains
Training efficiency and inference efficiency have different optima. Chinchilla-optimal means minimum loss for a given training compute — but the resulting model is large (many parameters) relative to its training data. For deployment, a smaller model trained on more data is cheaper per query: LLaMA 3 8B trained on 15T tokens (far more than Chinchilla-optimal) reaches the quality of a 70B Chinchilla-optimal model, but costs 8× less per inference request. Meta optimised for inference cost, not training efficiency — a deliberate engineering tradeoff.
Beyond simple scaling: 2026 developments
- Scaling plateaus: Multiple papers in 2025–2026 show that simple scaling laws are beginning to hit diminishing returns for pre-training loss on standard benchmarks. The frontier is exploring qualitative improvements (architecture changes, synthetic data, reasoning training) rather than brute-force scale.
- Emergent capability scaling: While average benchmark performance follows smooth power laws, specific capabilities emerge discontinuously (see Emergent Capabilities). This disconnect between the smooth scaling of training loss and discontinuous capability emergence remains poorly understood.
- Inference scaling: A 2024 finding by OpenAI showed that spending more compute at inference time (through extended chain-of-thought reasoning) can substitute for larger model size. This 'inference scaling' is the principle behind o3, o4-mini, and reasoning models — trading inference speed for quality.
Practice questions
- According to Chinchilla scaling laws, if you double your compute budget, how should you allocate the extra compute? (Answer: Chinchilla: optimal allocation is equal scaling of parameters N and training tokens D. Double compute → multiply both N and D by √2 ≈ 1.41×. For example, going from 70B/1.4T to 98B/2.0T is approximately Chinchilla-optimal. Doubling only parameters (undertrained) or only data (very large data for small model) is suboptimal.)
- GPT-3 used 300B training tokens for a 175B parameter model. According to Chinchilla, was this optimal? (Answer: No — severely undertrained. Chinchilla-optimal for 175B parameters requires approximately 3.5T training tokens (20 tokens per parameter). GPT-3 used only 300B tokens — ~12% of optimal. This is why GPT-3 was significantly underperformed by Chinchilla 70B trained on 1.4T tokens with less compute.)
- Why does LLaMA 3 intentionally train beyond Chinchilla-optimal (15T tokens for 70B params)? (Answer: Chinchilla-optimal minimises loss for a given TRAINING compute budget. But inference efficiency favours smaller, over-trained models: a 70B model trained on 15T tokens is cheaper to serve than a 200B model trained on 3T tokens that achieves the same loss. For products deployed at scale (millions of users), inference cost dominates. Over-training gives the best inference-time quality per parameter.)
- The irreducible loss L∞ in scaling laws represents what conceptually? (Answer: L∞ is the theoretical minimum loss achievable regardless of model size or training data — the Bayes-optimal loss for next-token prediction on the data distribution. It represents genuine unpredictability in language (ambiguity, random events, personal choices). Even a perfect model cannot predict these tokens. In practice, L∞ is estimated by fitting the power law to observed data.)
- Inference scaling laws (test-time compute scaling) suggest that more compute at inference time improves outputs. How does this differ from training scaling laws? (Answer: Training scaling laws: more compute during training (larger model or more data) improves the model permanently. Inference scaling: more compute at test time (more reasoning tokens, more sampled solutions, majority vote) improves outputs for that specific query without changing model weights. Reasoning models (o1, R1) exploit inference scaling — spending 10–100× more tokens per response than standard models to achieve better accuracy.)