Reasoning models are a new class of large language models trained to perform extended chain-of-thought reasoning before producing a final answer. OpenAI's o1 (September 2024) was the first widely deployed reasoning model — it scored 83% on the 2024 International Mathematics Olympiad qualifying exam, compared to 13% for GPT-4o. DeepSeek R1 (January 2025) replicated o1-level performance as an open-source model, setting off a wave of reasoning model development across the industry.
How reasoning models are trained: GRPO and process reward models
Standard LLMs are trained to predict the next token. Reasoning models are trained with reinforcement learning to maximize the correctness of final answers — the model learns to use its context window as a scratchpad. OpenAI uses a proprietary training process; DeepSeek R1 uses Group Relative Policy Optimization (GRPO), which eliminates the need for a separate critic model by using the average reward within a group of generated responses as the baseline.
GRPO objective: advantage A_i is computed relative to the group average reward rather than a learned value function. This eliminates the critic network entirely, reducing training memory by ~50% compared to standard PPO.
| Model | Creator | AIME 2024 | MATH-500 | SWE-Bench | Open? |
|---|---|---|---|---|---|
| o1 | OpenAI | 74.4% | 96.4% | 48.9% | No |
| o3 mini | OpenAI | 90.0% | 97.9% | 49.3% | No |
| DeepSeek R1 | DeepSeek | 79.8% | 97.3% | 49.2% | Yes |
| Claude 3.7 (thinking) | Anthropic | ~80% | ~97% | 70.3% | No |
| Gemini 2.5 Pro | 92.0% | 97.9% | Unreported | No |
When to use a reasoning model vs a standard model
- Use reasoning models for: math problems, formal proofs, multi-step coding tasks, complex logic puzzles, scientific analysis
- Use standard models for: writing, summarization, simple Q&A, translation, classification — tasks where extended thinking wastes time and money
- Reasoning models are 5–20x more expensive and 5–10x slower than equivalent standard models
- The 'thinking' tokens are often not shown to users but count toward your token bill
Practical rule
If a task could be solved by a smart person in 30 seconds, use a standard model. If it would take a PhD student 30 minutes of focused work, use a reasoning model.