Reasoning models are a new class of large language models trained to generate extended internal reasoning chains (often called 'thinking' or 'scratchpad') before producing their final answer. Unlike standard LLMs that respond immediately, reasoning models like OpenAI's o1/o3, Anthropic's Claude 3.7 Sonnet Extended Thinking, and DeepSeek-R1 spend compute at inference time exploring multiple solution paths, self-correcting, and backtracking. They achieve state-of-the-art results on mathematics, coding, and scientific reasoning benchmarks.
How reasoning models differ from standard LLMs
| Property | Standard LLM (e.g. GPT-4o) | Reasoning Model (e.g. o3, R1) |
|---|---|---|
| Response process | One forward pass — immediate output | Extended internal "thinking" phase → then answer |
| Compute at inference | Fixed (scales with output length only) | Variable — more thinking budget = better answers on hard problems |
| Strengths | Speed; broad knowledge; instruction following; creative writing | Mathematics; competition programming; multi-step logic; scientific reasoning |
| Weaknesses | Struggle with problems requiring backtracking | Slow; expensive; overkill for most conversational tasks |
| Training method | Standard SFT + RLHF | RL training that rewards correct final answers — the model discovers thinking strategies |
| Visible thinking | No (one shot output) | Often yes — thinking trace shown as a collapsible block |
Test-time compute scaling
Reasoning models discovered a new scaling axis: spending more compute at inference time improves accuracy on hard problems. This is separate from the usual training-time scaling (bigger model + more data). OpenAI showed that o3 with high compute budget achieves scores on competition math that were impossible for any previous model regardless of size.
Training reasoning models: reinforcement learning on outcomes
Standard LLMs are trained to imitate correct outputs (supervised fine-tuning). Reasoning models are trained to achieve correct outcomes — the model gets a reward for correct final answers regardless of the reasoning path taken. This forces the model to discover effective reasoning strategies on its own, similar to how AlphaGo learned to play Go.
DeepSeek-R1's training paper showed that with pure RL (GRPO — Group Relative Policy Optimization) on math problems with verifiable answers, models spontaneously develop behaviors like self-verification, backtracking, and trying alternative approaches — none of which were explicitly programmed.
| Model | Organization | Thinking visibility | Training approach | Key benchmark (AIME 2024) |
|---|---|---|---|---|
| o1 | OpenAI | Hidden (thinking not shown) | RL on outcomes; proprietary method | 74.4% |
| o3 | OpenAI | Hidden | Scaled version of o1 training | 96.7% (high compute) |
| Claude 3.7 Sonnet (Extended Thinking) | Anthropic | Shown (collapsible) | Proprietary RL + RLHF hybrid | ~80% |
| DeepSeek-R1 | DeepSeek | Shown fully | GRPO pure RL + SFT distillation; open weights | 79.8% |
| QwQ-32B | Alibaba Qwen | Shown fully | RL fine-tune on Qwen2.5; open weights | 50.0% |
When to use reasoning models vs standard models
| Use case | Use reasoning model? | Why |
|---|---|---|
| Competition math / AIME / Olympiad problems | Yes — essential | Scores 3–5× higher than standard models |
| Complex multi-file code debugging | Yes | Multi-step logical inference; needs backtracking |
| Scientific paper analysis / PhD-level questions | Yes | GPQA scores ~25% higher than standard GPT-4o |
| Casual conversation / simple Q&A | No — overkill | Reasoning adds latency and cost with no benefit |
| Creative writing / brainstorming | No | Extended thinking doesn't help; standard models are better |
| Real-time chat / customer support | No | Reasoning models have 10–60s latency; unacceptable for live chat |
| Coding interview problems (LeetCode Hard) | Yes | o3-mini matches top human performance on competitive programming |
| Data analysis with complex logic | Maybe | Use reasoning if the logic chains are 5+ steps; otherwise standard is fine |
Hybrid approach
Many production systems use a router: simple requests go to fast standard models (GPT-4o mini, Claude Haiku), complex requests are routed to reasoning models (o3-mini, R1). This balances cost and latency against reasoning quality. Tools like LangChain and LiteLLM support routing based on task complexity.
Practice questions
- What is the trade-off when using a reasoning model (o3, R1) vs a standard LLM for a simple factual question? (Answer: Reasoning models are slower and more expensive: they generate 500–2000 thinking tokens before the final answer. For a simple question like 'What is the capital of France?' this is wasteful — a standard model answers correctly in 5 tokens. Reasoning models are valuable for: multi-step math, complex coding, logical puzzles, scientific analysis. Deployment best practice: route simple queries to fast standard models, complex reasoning queries to reasoning models.)
- What is 'extended thinking' in Claude 3.7 Sonnet and how does it differ from standard chain-of-thought prompting? (Answer: Standard CoT: the model generates visible reasoning steps as part of the response — the reasoning appears in the final output. Extended thinking: Claude generates internal reasoning tokens (marked as
in the API) that are not shown to the user by default but inform the final answer. The thinking is genuinely internal computation — the model can explore dead ends, backtrack, and self-correct in ways that would look odd in a visible response.) - DeepSeek-R1 showed reasoning emerges from RL on math problems without supervision of reasoning steps. What is the key insight? (Answer: You do not need human demonstrations of reasoning chains (expensive to collect). You only need verifiable final answers (cheap: just check if the math answer is correct). GRPO training on correct/incorrect signals causes the model to spontaneously develop internal reasoning strategies to maximise the reward. The model discovers that longer, structured thinking leads to more correct answers — without ever being taught what 'reasoning' looks like.)
- What is the accuracy-compute trade-off in reasoning models and how do you optimise it? (Answer: Reasoning models spend more compute (tokens) per question to achieve higher accuracy. The relationship is approximately log-linear: doubling thinking tokens gives diminishing accuracy improvements. Optimise by: (1) Setting a thinking budget (max tokens for reasoning). (2) Routing — use reasoning models only for high-stakes queries. (3) For production: measure accuracy vs compute per query type, find the Pareto-optimal point. OpenAI offers o1-mini for faster/cheaper reasoning than o1-full.)
- Why do reasoning models sometimes 'overthink' simple problems? (Answer: Reasoning models are trained to think before answering — this becomes a strong prior even when unnecessary. For straightforward problems, the extended thinking may introduce errors by considering irrelevant alternative interpretations or second-guessing correct first answers. Studies show reasoning models sometimes perform WORSE than standard models on easy questions. This is an active alignment problem: teaching models to calibrate the amount of thinking to problem complexity.)
On LumiChats
LumiChats gives you access to reasoning models including Claude's Extended Thinking mode for hard problems — toggle it on when you need deep mathematical reasoning or multi-step debugging.
Try it free