What is beyond simple scaling: 2026 developments?

Neural Scaling Laws: Beyond simple scaling: 2026 developments. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/neural-scaling-laws

What is practice questions?

Neural Scaling Laws: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/neural-scaling-laws

Neural Scaling Laws

Neural scaling laws are empirical power-law relationships between the performance of neural networks and the scale of their training — measured by model parameters (N), training tokens (D), and compute budget (C). Discovered by Kaplan et al. at OpenAI in 2020 and refined by Hoffmann et al. at DeepMind (the 'Chinchilla' paper) in 2022, scaling laws allow AI labs to predict model performance at larger scales without training full models — guiding decisions about how to allocate compute between model size and data volume. Scaling laws are the primary reason AI labs confidently invest hundreds of millions of dollars in training runs: they can predict the outcome before starting.

The mathematical rules that predict how AI performance improves with more data, compute, and parameters.

Category: Machine Learning

The Kaplan and Chinchilla scaling laws

L(N, D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_{\infty}

The Kaplan laws suggested that at a fixed compute budget, you should scale model size faster than data. The 2022 Chinchilla paper (Hoffmann et al.) showed this was wrong: the optimal allocation is equal scaling of parameters and training tokens. The Chinchilla-optimal recipe: for each 10× increase in compute, increase model parameters 3.1× and increase training tokens 3.1×. A model trained with Chinchilla-optimal allocation will always outperform a larger model trained on fewer tokens for the same total compute.

Model	Parameters	Training tokens	Compute (FLOP)	Chinchilla optimal?
GPT-3 (2020)	175B	300B	3.1×10²³	No — undertrained (Kaplan recipe)
Chinchilla (2022)	70B	1.4T	5.8×10²³	Yes — set the standard
LLaMA 2 (2023)	70B	2T	~7×10²³	Over-trained (more tokens than Chinchilla-optimal)
LLaMA 3 (2024)	70B / 405B	15T / 15T	~3×10²⁴	Intentionally over-trained for inference efficiency
GPT-4 (2023, estimated)	~1T (MoE)	~13T	~2×10²⁵	Approximately optimal for its compute budget

Why LLaMA 3 intentionally over-trains: Training efficiency and inference efficiency have different optima. Chinchilla-optimal means minimum loss for a given training compute — but the resulting model is large (many parameters) relative to its training data. For deployment, a smaller model trained on more data is cheaper per query: LLaMA 3 8B trained on 15T tokens (far more than Chinchilla-optimal) reaches the quality of a 70B Chinchilla-optimal model, but costs 8× less per inference request. Meta optimized for inference cost, not training efficiency — a deliberate engineering tradeoff.

Beyond simple scaling: 2026 developments

Scaling plateaus: Multiple papers in 2025–2026 show that simple scaling laws are beginning to hit diminishing returns for pre-training loss on standard benchmarks. The frontier is exploring qualitative improvements (architecture changes, synthetic data, reasoning training) rather than brute-force scale.
Emergent capability scaling: While average benchmark performance follows smooth power laws, specific capabilities emerge discontinuously (see Emergent Capabilities). This disconnect between the smooth scaling of training loss and discontinuous capability emergence remains poorly understood.
Inference scaling: A 2024 finding by OpenAI showed that spending more compute at inference time (through extended chain-of-thought reasoning) can substitute for larger model size. This 'inference scaling' is the principle behind o3, o4-mini, and reasoning models — trading inference speed for quality.

Practice questions

According to Chinchilla scaling laws, if you double your compute budget, how should you allocate the extra compute? (Answer: Chinchilla: optimal allocation is equal scaling of parameters N and training tokens D. Double compute → multiply both N and D by √2 ≈ 1.41×. For example, going from 70B/1.4T to 98B/2.0T is approximately Chinchilla-optimal. Doubling only parameters (undertrained) or only data (very large data for small model) is suboptimal.)
GPT-3 used 300B training tokens for a 175B parameter model. According to Chinchilla, was this optimal? (Answer: No — severely undertrained. Chinchilla-optimal for 175B parameters requires approximately 3.5T training tokens (20 tokens per parameter). GPT-3 used only 300B tokens — ~12% of optimal. This is why GPT-3 was significantly underperformed by Chinchilla 70B trained on 1.4T tokens with less compute.)
Why does LLaMA 3 intentionally train beyond Chinchilla-optimal (15T tokens for 70B params)? (Answer: Chinchilla-optimal minimizes loss for a given TRAINING compute budget. But inference efficiency favors smaller, over-trained models: a 70B model trained on 15T tokens is cheaper to serve than a 200B model trained on 3T tokens that achieves the same loss. For products deployed at scale (millions of users), inference cost dominates. Over-training gives the best inference-time quality per parameter.)
The irreducible loss L∞ in scaling laws represents what conceptually? (Answer: L∞ is the theoretical minimum loss achievable regardless of model size or training data — the Bayes-optimal loss for next-token prediction on the data distribution. It represents genuine unpredictability in language (ambiguity, random events, personal choices). Even a perfect model cannot predict these tokens. In practice, L∞ is estimated by fitting the power law to observed data.)
Inference scaling laws (test-time compute scaling) suggest that more compute at inference time improves outputs. How does this differ from training scaling laws? (Answer: Training scaling laws: more compute during training (larger model or more data) improves the model permanently. Inference scaling: more compute at test time (more reasoning tokens, more sampled solutions, majority vote) improves outputs for that specific query without changing model weights. Reasoning models (o1, R1) exploit inference scaling — spending 10–100× more tokens per response than standard models to achieve better accuracy.)

Model

Parameters

Training tokens

Compute (FLOP)

Chinchilla optimal?

GPT-3 (2020)

175B

300B

3.1×10²³

No — undertrained (Kaplan recipe)

Chinchilla (2022)

70B

1.4T

5.8×10²³

Yes — set the standard

LLaMA 2 (2023)

70B

~7×10²³

Over-trained (more tokens than Chinchilla-optimal)

LLaMA 3 (2024)

70B / 405B

15T / 15T

~3×10²⁴

Intentionally over-trained for inference efficiency

GPT-4 (2023, estimated)

~1T (MoE)

~13T

~2×10²⁵

Approximately optimal for its compute budget

Neural Scaling Laws

The Kaplan and Chinchilla scaling laws

Beyond simple scaling: 2026 developments

Practice questions

Neural Scaling Laws

The Kaplan and Chinchilla scaling laws

Beyond simple scaling: 2026 developments

Practice questions

Practice what you just learned

Related Terms