What is practice questions?

Reasoning Models: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/reasoning-models

What is 2026 reasoning model landscape and benchmarks?

Reasoning Models: 2026 reasoning model landscape and benchmarks. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/reasoning-models

Reasoning Models

Reasoning models are a new class of large language models trained to generate extended internal reasoning chains (often called 'thinking' or 'scratchpad') before producing their final answer. Unlike standard LLMs that respond immediately, reasoning models like OpenAI's o1/o3, Anthropic's Claude 3.7 Sonnet Extended Thinking, and DeepSeek-R1 spend compute at inference time exploring multiple solution paths, self-correcting, and backtracking. They achieve state-of-the-art results on mathematics, coding, and scientific reasoning benchmarks.

AI that thinks before it speaks — and gets dramatically better at hard problems.

Category: AI Fundamentals

How reasoning models differ from standard LLMs

Property	Standard LLM (e.g. GPT-4o)	Reasoning Model (e.g. o3, R1)
Response process	One forward pass — immediate output	Extended internal "thinking" phase → then answer
Compute at inference	Fixed (scales with output length only)	Variable — more thinking budget = better answers on hard problems
Strengths	Speed; broad knowledge; instruction following; creative writing	Mathematics; competition programming; multi-step logic; scientific reasoning
Weaknesses	Struggle with problems requiring backtracking	Slow; expensive; overkill for most conversational tasks
Training method	Standard SFT + RLHF	RL training that rewards correct final answers — the model discovers thinking strategies
Visible thinking	No (one shot output)	Often yes — thinking trace shown as a collapsible block

Test-time compute scaling: Reasoning models discovered a new scaling axis: spending more compute at inference time improves accuracy on hard problems. This is separate from the usual training-time scaling (bigger model + more data). OpenAI showed that o3 with high compute budget achieves scores on competition math that were impossible for any previous model regardless of size.

Training reasoning models: reinforcement learning on outcomes

Standard LLMs are trained to imitate correct outputs (supervised fine-tuning). Reasoning models are trained to achieve correct outcomes — the model gets a reward for correct final answers regardless of the reasoning path taken. This forces the model to discover effective reasoning strategies on its own, similar to how AlphaGo learned to play Go.

DeepSeek-R1's training paper showed that with pure RL (GRPO — Group Relative Policy Optimization) on math problems with verifiable answers, models spontaneously develop behaviors like self-verification, backtracking, and trying alternative approaches — none of which were explicitly programd.

Model	Organization	Thinking visibility	Training approach	Key benchmark (AIME 2024)
o1	OpenAI	Hidden (thinking not shown)	RL on outcomes; proprietary method	74.4%
o3	OpenAI	Hidden	Scaled version of o1 training	96.7% (high compute)
Claude 3.7 Sonnet (Extended Thinking)	Anthropic	Shown (collapsible)	Proprietary RL + RLHF hybrid	~80%
DeepSeek-R1	DeepSeek	Shown fully	GRPO pure RL + SFT distillation; open weights	79.8%
QwQ-32B	Alibaba Qwen	Shown fully	RL fine-tune on Qwen2.5; open weights	50.0%

When to use reasoning models vs standard models

Use case	Use reasoning model?	Why
Competition math / AIME / Olympiad problems	Yes — essential	Scores 3–5× higher than standard models
Complex multi-file code debugging	Yes	Multi-step logical inference; needs backtracking
Scientific paper analysis / PhD-level questions	Yes	GPQA scores ~25% higher than standard GPT-4o
Casual conversation / simple Q&A	No — overkill	Reasoning adds latency and cost with no benefit
Creative writing / brainstorming	No	Extended thinking doesn't help; standard models are better
Real-time chat / customer support	No	Reasoning models have 10–60s latency; unacceptable for live chat
Coding interview problems (LeetCode Hard)	Yes	o3-mini matches top human performance on competitive programming
Data analysis with complex logic	Maybe	Use reasoning if the logic chains are 5+ steps; otherwise standard is fine

Hybrid approach: Many production systems use a router: simple requests go to fast standard models (GPT-4o mini, Claude Haiku), complex requests are routed to reasoning models (o3-mini, R1). This balances cost and latency against reasoning quality. Tools like LangChain and LiteLLM support routing based on task complexity.

Practice questions

What is the trade-off when using a reasoning model (o3, R1) vs a standard LLM for a simple factual question? (Answer: Reasoning models are slower and more expensive: they generate 500–2000 thinking tokens before the final answer. For a simple question like 'What is the capital of France?' this is wasteful — a standard model answers correctly in 5 tokens. Reasoning models are valuable for: multi-step math, complex coding, logical puzzles, scientific analysis. Deployment best practice: route simple queries to fast standard models, complex reasoning queries to reasoning models.)
What is 'extended thinking' in Claude 3.7 Sonnet and how does it differ from standard chain-of-thought prompting? (Answer: Standard CoT: the model generates visible reasoning steps as part of the response — the reasoning appears in the final output. Extended thinking: Claude generates internal reasoning tokens (marked as in the API) that are not shown to the user by default but inform the final answer. The thinking is genuinely internal computation — the model can explore dead ends, backtrack, and self-correct in ways that would look odd in a visible response.)
DeepSeek-R1 showed reasoning emerges from RL on math problems without supervision of reasoning steps. What is the key insight? (Answer: You do not need human demonstrations of reasoning chains (expensive to collect). You only need verifiable final answers (cheap: just check if the math answer is correct). GRPO training on correct/incorrect signals causes the model to spontaneously develop internal reasoning strategies to maximize the reward. The model discovers that longer, structured thinking leads to more correct answers — without ever being taught what 'reasoning' looks like.)
What is the accuracy-compute trade-off in reasoning models and how do you optimize it? (Answer: Reasoning models spend more compute (tokens) per question to achieve higher accuracy. The relationship is approximately log-linear: doubling thinking tokens gives diminishing accuracy improvements. Optimize by: (1) Setting a thinking budget (max tokens for reasoning). (2) Routing — use reasoning models only for high-stakes queries. (3) For production: measure accuracy vs compute per query type, find the Pareto-optimal point. OpenAI offers o1-mini for faster/cheaper reasoning than o1-full.)
Why do reasoning models sometimes 'overthink' simple problems? (Answer: Reasoning models are trained to think before answering — this becomes a strong prior even when unnecessary. For straightforward problems, the extended thinking may introduce errors by considering irrelevant alternative interpretations or second-guessing correct first answers. Studies show reasoning models sometimes perform WORSE than standard models on easy questions. This is an active alignment problem: teaching models to calibrate the amount of thinking to problem complexity.)

Calling reasoning models via the API — code examples

Reasoning models expose two practical controls you won't find on standard LLMs: a thinking budget (how many internal reasoning tokens the model can spend) and, on some models, a visible thinking trace. Here's how to call each major reasoning model in 2026.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000   # how many tokens Claude can spend thinking
        # increase budget for harder problems; 1000–3000 is fine for most tasks
        # max: 32000 for claude-opus-4-6
    },
    messages=[{
        "role": "user",
        "content": (
            "Solve this competition math problem step by step, showing your work:

"
            "A sequence is defined by a(1)=1, a(2)=1, and a(n)=a(n-1)+a(n-2)+n for n>2. "
            "Find the units digit of a(2026)."
        )
    }]
)

# Separate the thinking trace from the final answer
for block in response.content:
    if block.type == "thinking":
        print(f"[THINKING — {len(block.thinking)} chars]")
        print(block.thinking[:500], "...
")   # first 500 chars of scratchpad
    elif block.type == "text":
        print("[ANSWER]")
        print(block.text)

# Practical tip: higher budget_tokens → better accuracy on hard problems
# but also higher latency and cost. Sweet spot for most math: 8000–12000.

from openai import OpenAI

client = OpenAI()

# o3-mini: fast reasoning, lower cost — great for coding and math
response = client.chat.completions.create(
    model="o3-mini",
    reasoning_effort="high",    # "low" | "medium" | "high"
                                # high = best accuracy, slower + pricier
    messages=[
        {
            "role": "user",
            "content": "Write a Python function that checks if a number is prime using the Miller-Rabin test, then verify it on all primes up to 1000."
        }
    ]
)

print(response.choices[0].message.content)

# o3: maximum reasoning capability — for hardest problems
response_o3 = client.chat.completions.create(
    model="o3",
    reasoning_effort="high",
    messages=[{"role": "user", "content": "Prove that there are infinitely many prime numbers using Euclid's method, then explain why the proof fails if we replace primes with perfect squares."}]
)

# Cost guide (approximate April 2026):
# o3-mini low:    $0.15/1M input,  $0.60/1M output  — fast, cheap
# o3-mini high:   $1.10/1M input,  $4.40/1M output  — best for coding
# o3 high:        $10/1M input,   $40/1M output     — hardest problems only
# DeepSeek-R1:    $0.55/1M input,  $2.19/1M output  — open-source competitive

When to increase the thinking budget: Start with the minimum reasoning effort and increase only if you're not getting correct answers. For 95% of real-world tasks — including most software engineering, data analysis, and writing — standard models (GPT-4o, Claude Sonnet) without extended thinking are faster and cheaper. Reserve high reasoning budgets for: competition math, complex multi-file debugging, PhD-level scientific analysis, and legal/medical document reasoning where errors are costly.

2026 reasoning model landscape and benchmarks

Model	Org	AIME 2024	SWE-bench	GPQA Diamond	Cost/1M tokens (input)	Open weights?
o3 (high compute)	OpenAI	96.7%	71.7%	87.7%	~$10	No
o3-mini (high)	OpenAI	90.0%	49.3%	79.7%	~$1.10	No
DeepSeek-R1	DeepSeek	79.8%	49.2%	71.5%	~$0.55	Yes — MIT license
Claude 4 Sonnet (thinking)	Anthropic	~83%	~65%	~80%	~$3.00	No
Gemini 2.5 Pro	Google	~86%	~63%	~84%	~$1.25	No
QwQ-32B-Preview	Alibaba	50.0%	41.9%	65.2%	~$0.40	Yes — Apache 2.0
GPT-4o (no reasoning)	OpenAI	13.4%	38.8%	53.6%	~$2.50	No

The open-source parity moment: DeepSeek-R1 was the landmark: a fully open-weights reasoning model matching proprietary o1 performance, released January 2025. This compressed the reasoning model gap dramatically — QwQ-32B, R1-Distill variants, and Llama-based fine-tunes followed. By mid-2026 you can run a capable reasoning model locally on a single A100. The frontier has shifted to speed and cost efficiency rather than raw benchmark scores.

Use case	Recommended model	Why
AIME / Olympiad math	o3 high compute	Near-human performance; worth the cost for competition prep
LeetCode hard / SWE-bench	o3-mini high OR DeepSeek-R1	Strong coding; R1 is 5× cheaper
PhD-level science Q&A	Claude Sonnet extended thinking	GPQA Diamond leader; visible reasoning trace
Complex multi-step analysis	Gemini 2.5 Pro	Long context + reasoning; great for documents
Local / private deployment	DeepSeek-R1 7B distill	Runs on consumer GPU; surprisingly strong for a 7B
High-volume API (cheap reasoning)	QwQ-32B or R1-Distill-Llama	Open-source; self-hostable; 70%+ AIME
Quick daily tasks	GPT-4o / Claude Sonnet (no extended thinking)	Reasoning adds latency/cost with no benefit here

LumiChats gives you access to reasoning models including Claude's Extended Thinking mode for hard problems — toggle it on when you need deep mathematical reasoning or multi-step debugging.

Definition

How reasoning models differ from standard LLMs

Property	Standard LLM (e.g. GPT-4o)	Reasoning Model (e.g. o3, R1)
Response process	One forward pass — immediate output	Extended internal "thinking" phase → then answer
Compute at inference	Fixed (scales with output length only)	Variable — more thinking budget = better answers on hard problems
Strengths	Speed; broad knowledge; instruction following; creative writing	Mathematics; competition programming; multi-step logic; scientific reasoning
Weaknesses	Struggle with problems requiring backtracking	Slow; expensive; overkill for most conversational tasks
Training method	Standard SFT + RLHF	RL training that rewards correct final answers — the model discovers thinking strategies
Visible thinking	No (one shot output)	Often yes — thinking trace shown as a collapsible block

Test-time compute scaling

Reasoning models discovered a new scaling axis: spending more compute at inference time improves accuracy on hard problems. This is separate from the usual training-time scaling (bigger model + more data). OpenAI showed that o3 with high compute budget achieves scores on competition math that were impossible for any previous model regardless of size.

Training reasoning models: reinforcement learning on outcomes

Model	Organization	Thinking visibility	Training approach	Key benchmark (AIME 2024)
o1	OpenAI	Hidden (thinking not shown)	RL on outcomes; proprietary method	74.4%
o3	OpenAI	Hidden	Scaled version of o1 training	96.7% (high compute)
Claude 3.7 Sonnet (Extended Thinking)	Anthropic	Shown (collapsible)	Proprietary RL + RLHF hybrid	~80%
DeepSeek-R1	DeepSeek	Shown fully	GRPO pure RL + SFT distillation; open weights	79.8%
QwQ-32B	Alibaba Qwen	Shown fully	RL fine-tune on Qwen2.5; open weights	50.0%

When to use reasoning models vs standard models

Use case	Use reasoning model?	Why
Competition math / AIME / Olympiad problems	Yes — essential	Scores 3–5× higher than standard models
Complex multi-file code debugging	Yes	Multi-step logical inference; needs backtracking
Scientific paper analysis / PhD-level questions	Yes	GPQA scores ~25% higher than standard GPT-4o
Casual conversation / simple Q&A	No — overkill	Reasoning adds latency and cost with no benefit
Creative writing / brainstorming	No	Extended thinking doesn't help; standard models are better
Real-time chat / customer support	No	Reasoning models have 10–60s latency; unacceptable for live chat
Coding interview problems (LeetCode Hard)	Yes	o3-mini matches top human performance on competitive programming
Data analysis with complex logic	Maybe	Use reasoning if the logic chains are 5+ steps; otherwise standard is fine

Hybrid approach

Many production systems use a router: simple requests go to fast standard models (GPT-4o mini, Claude Haiku), complex requests are routed to reasoning models (o3-mini, R1). This balances cost and latency against reasoning quality. Tools like LangChain and LiteLLM support routing based on task complexity.

Practice questions

What is the trade-off when using a reasoning model (o3, R1) vs a standard LLM for a simple factual question? (Answer: Reasoning models are slower and more expensive: they generate 500–2000 thinking tokens before the final answer. For a simple question like 'What is the capital of France?' this is wasteful — a standard model answers correctly in 5 tokens. Reasoning models are valuable for: multi-step math, complex coding, logical puzzles, scientific analysis. Deployment best practice: route simple queries to fast standard models, complex reasoning queries to reasoning models.)
What is 'extended thinking' in Claude 3.7 Sonnet and how does it differ from standard chain-of-thought prompting? (Answer: Standard CoT: the model generates visible reasoning steps as part of the response — the reasoning appears in the final output. Extended thinking: Claude generates internal reasoning tokens (marked as in the API) that are not shown to the user by default but inform the final answer. The thinking is genuinely internal computation — the model can explore dead ends, backtrack, and self-correct in ways that would look odd in a visible response.)
DeepSeek-R1 showed reasoning emerges from RL on math problems without supervision of reasoning steps. What is the key insight? (Answer: You do not need human demonstrations of reasoning chains (expensive to collect). You only need verifiable final answers (cheap: just check if the math answer is correct). GRPO training on correct/incorrect signals causes the model to spontaneously develop internal reasoning strategies to maximize the reward. The model discovers that longer, structured thinking leads to more correct answers — without ever being taught what 'reasoning' looks like.)
What is the accuracy-compute trade-off in reasoning models and how do you optimize it? (Answer: Reasoning models spend more compute (tokens) per question to achieve higher accuracy. The relationship is approximately log-linear: doubling thinking tokens gives diminishing accuracy improvements. Optimize by: (1) Setting a thinking budget (max tokens for reasoning). (2) Routing — use reasoning models only for high-stakes queries. (3) For production: measure accuracy vs compute per query type, find the Pareto-optimal point. OpenAI offers o1-mini for faster/cheaper reasoning than o1-full.)
Why do reasoning models sometimes 'overthink' simple problems? (Answer: Reasoning models are trained to think before answering — this becomes a strong prior even when unnecessary. For straightforward problems, the extended thinking may introduce errors by considering irrelevant alternative interpretations or second-guessing correct first answers. Studies show reasoning models sometimes perform WORSE than standard models on easy questions. This is an active alignment problem: teaching models to calibrate the amount of thinking to problem complexity.)

Calling reasoning models via the API — code examples

Claude Extended Thinking — configurable reasoning budget with visible thinking trace

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000   # how many tokens Claude can spend thinking
        # increase budget for harder problems; 1000–3000 is fine for most tasks
        # max: 32000 for claude-opus-4-6
    },
    messages=[{
        "role": "user",
        "content": (
            "Solve this competition math problem step by step, showing your work:

"
            "A sequence is defined by a(1)=1, a(2)=1, and a(n)=a(n-1)+a(n-2)+n for n>2. "
            "Find the units digit of a(2026)."
        )
    }]
)

# Separate the thinking trace from the final answer
for block in response.content:
    if block.type == "thinking":
        print(f"[THINKING — {len(block.thinking)} chars]")
        print(block.thinking[:500], "...
")   # first 500 chars of scratchpad
    elif block.type == "text":
        print("[ANSWER]")
        print(block.text)

# Practical tip: higher budget_tokens → better accuracy on hard problems
# but also higher latency and cost. Sweet spot for most math: 8000–12000.

OpenAI o3 and o3-mini — reasoning effort control

from openai import OpenAI

client = OpenAI()

# o3-mini: fast reasoning, lower cost — great for coding and math
response = client.chat.completions.create(
    model="o3-mini",
    reasoning_effort="high",    # "low" | "medium" | "high"
                                # high = best accuracy, slower + pricier
    messages=[
        {
            "role": "user",
            "content": "Write a Python function that checks if a number is prime using the Miller-Rabin test, then verify it on all primes up to 1000."
        }
    ]
)

print(response.choices[0].message.content)

# o3: maximum reasoning capability — for hardest problems
response_o3 = client.chat.completions.create(
    model="o3",
    reasoning_effort="high",
    messages=[{"role": "user", "content": "Prove that there are infinitely many prime numbers using Euclid's method, then explain why the proof fails if we replace primes with perfect squares."}]
)

# Cost guide (approximate April 2026):
# o3-mini low:    $0.15/1M input,  $0.60/1M output  — fast, cheap
# o3-mini high:   $1.10/1M input,  $4.40/1M output  — best for coding
# o3 high:        $10/1M input,   $40/1M output     — hardest problems only
# DeepSeek-R1:    $0.55/1M input,  $2.19/1M output  — open-source competitive

When to increase the thinking budget

Start with the minimum reasoning effort and increase only if you're not getting correct answers. For 95% of real-world tasks — including most software engineering, data analysis, and writing — standard models (GPT-4o, Claude Sonnet) without extended thinking are faster and cheaper. Reserve high reasoning budgets for: competition math, complex multi-file debugging, PhD-level scientific analysis, and legal/medical document reasoning where errors are costly.

2026 reasoning model landscape and benchmarks

Model	Org	AIME 2024	SWE-bench	GPQA Diamond	Cost/1M tokens (input)	Open weights?
o3 (high compute)	OpenAI	96.7%	71.7%	87.7%	~$10	No
o3-mini (high)	OpenAI	90.0%	49.3%	79.7%	~$1.10	No
DeepSeek-R1	DeepSeek	79.8%	49.2%	71.5%	~$0.55	Yes — MIT license
Claude 4 Sonnet (thinking)	Anthropic	~83%	~65%	~80%	~$3.00	No
Gemini 2.5 Pro	Google	~86%	~63%	~84%	~$1.25	No
QwQ-32B-Preview	Alibaba	50.0%	41.9%	65.2%	~$0.40	Yes — Apache 2.0
GPT-4o (no reasoning)	OpenAI	13.4%	38.8%	53.6%	~$2.50	No

The open-source parity moment

DeepSeek-R1 was the landmark: a fully open-weights reasoning model matching proprietary o1 performance, released January 2025. This compressed the reasoning model gap dramatically — QwQ-32B, R1-Distill variants, and Llama-based fine-tunes followed. By mid-2026 you can run a capable reasoning model locally on a single A100. The frontier has shifted to speed and cost efficiency rather than raw benchmark scores.

Use case	Recommended model	Why
AIME / Olympiad math	o3 high compute	Near-human performance; worth the cost for competition prep
LeetCode hard / SWE-bench	o3-mini high OR DeepSeek-R1	Strong coding; R1 is 5× cheaper
PhD-level science Q&A	Claude Sonnet extended thinking	GPQA Diamond leader; visible reasoning trace
Complex multi-step analysis	Gemini 2.5 Pro	Long context + reasoning; great for documents
Local / private deployment	DeepSeek-R1 7B distill	Runs on consumer GPU; surprisingly strong for a 7B
High-volume API (cheap reasoning)	QwQ-32B or R1-Distill-Llama	Open-source; self-hostable; 70%+ AIME
Quick daily tasks	GPT-4o / Claude Sonnet (no extended thinking)	Reasoning adds latency/cost with no benefit here

On LumiChats

LumiChats gives you access to reasoning models including Claude's Extended Thinking mode for hard problems — toggle it on when you need deep mathematical reasoning or multi-step debugging.

Try it free

Reasoning Models

How reasoning models differ from standard LLMs

Training reasoning models: reinforcement learning on outcomes

When to use reasoning models vs standard models

Practice questions

Calling reasoning models via the API — code examples

2026 reasoning model landscape and benchmarks

Reasoning Models

How reasoning models differ from standard LLMs

Training reasoning models: reinforcement learning on outcomes

When to use reasoning models vs standard models

Practice questions

Calling reasoning models via the API — code examples

2026 reasoning model landscape and benchmarks

Practice what you just learned

Related Terms