Glossary/Chain-of-Thought (CoT) Reasoning
Model Training & Optimization

Chain-of-Thought (CoT) Reasoning

Teaching AI to think before it answers.


Definition

Chain-of-Thought (CoT) prompting is a technique where language models are prompted or trained to generate intermediate reasoning steps before producing a final answer. Instead of jumping directly to an answer, the model thinks through the problem step-by-step. CoT dramatically improves performance on multi-step reasoning, mathematics, logic, and complex analytical tasks.

The original CoT discovery

Chain-of-thought prompting was introduced by Wei et al. (Google Brain, 2022) with a startling finding: simply appending "Let's think step by step" to a prompt, or providing few-shot examples with step-by-step reasoning, dramatically improved LLM performance on math and reasoning benchmarks — with no weight updates at all.

ModelGSM8K standardGSM8K + CoTGainNote
GPT-3 175B18%57%+39ppCoT only emerged in models >100B params
PaLM 540B17%56%+39ppNear GPT-3 level; scale drives CoT benefit
PaLM 280%91%+11ppDiminishing gains as base capability rises
GPT-487%92%+5ppDiminishing returns at frontier
GPT-4o + self-consistency92%97%+5ppSelf-consistency further boosts hard problems

Emergent capability

CoT showed no benefit for models under ~100B parameters — it only helps once the model is large enough to actually reason. This is an example of an "emergent capability": a behavior that appears suddenly at a scale threshold, not gradually. Smaller models that attempt CoT often produce fluent but meaningless or incorrect reasoning traces.

Zero-shot vs few-shot CoT

Three approaches to eliciting chain-of-thought reasoning, with different tradeoffs in setup cost vs performance.

All three CoT approaches using the OpenAI API — from simplest to most powerful

from openai import OpenAI

client = OpenAI()
MODEL = "gpt-4o"

def ask(messages):
    return client.chat.completions.create(
        model=MODEL, messages=messages, temperature=0
    ).choices[0].message.content

PROBLEM = "If a train travels 120 km in 1.5 hours, then stops for 30 minutes, then travels another 90 km in 1 hour, what is its average speed for the entire journey including the stop?"

# ─── Approach 1: Standard (no CoT) ───────────────────────────────────────────
standard = ask([{"role": "user", "content": PROBLEM}])
# Often gives wrong answer: jumps to (120+90)/(1.5+1) = 84 km/h, forgetting the stop

# ─── Approach 2: Zero-shot CoT ───────────────────────────────────────────────
zero_shot_cot = ask([{"role": "user", "content": PROBLEM + "\n\nLet's think step by step."}])
# Model breaks down: total distance=210km, total time=1.5+0.5+1=3h, avg=70 km/h ✓

# ─── Approach 3: Few-shot CoT ─────────────────────────────────────────────────
few_shot_system = """Solve math problems by thinking step by step.
Show each calculation on its own line.
Clearly label: Total distance, Total time, Final answer."""

few_shot_messages = [
    {"role": "system",    "content": few_shot_system},
    {"role": "user",      "content": "A car goes 60 km in 1 hour, stops 15 min, goes 45 km in 45 min. Avg speed?"},
    {"role": "assistant", "content": "Total distance: 60 + 45 = 105 km\nTotal time: 1 + 0.25 + 0.75 = 2 hours\nAverage speed: 105 / 2 = 52.5 km/h"},
    {"role": "user",      "content": PROBLEM},
]
few_shot_cot = ask(few_shot_messages)
# Most reliable: follows demonstrated reasoning structure exactly

# ─── Approach 4: Self-consistency (best accuracy) ────────────────────────────
import re
from collections import Counter

def self_consistent_answer(problem: str, n_samples: int = 10) -> str:
    """Generate n independent CoT solutions and take majority vote."""
    answers = []
    for _ in range(n_samples):
        resp = client.chat.completions.create(
            model=MODEL,
            messages=[{"role": "user", "content": problem + "\n\nThink step by step, then give the final numeric answer."}],
            temperature=0.7,   # diversity needed for voting to help
        ).choices[0].message.content
        # Extract the last number mentioned as the answer
        nums = re.findall(r'\d+\.?\d*', resp)
        if nums: answers.append(nums[-1])

    if answers:
        most_common = Counter(answers).most_common(1)[0][0]
        return most_common
    return "No consensus"

answer = self_consistent_answer(PROBLEM, n_samples=15)
print(f"Self-consistent answer: {answer} km/h")  # → 70

Self-consistency tradeoff

Self-consistency (majority vote over 10–40 samples) consistently adds 5–10% accuracy on hard benchmarks but multiplies API cost by N. Use it when: the task is high-stakes (math exams, code generation), base accuracy is already 60–80% (voting helps), and you can afford the cost. For 90%+ accuracy tasks or latency-sensitive apps, single-path CoT is sufficient.

How CoT works internally

Why does writing out reasoning steps improve a model's final answer? The mechanism is not fully understood, but research has narrowed it down to three complementary explanations.

  1. Computation allocation: Each generated token is a full forward pass through the model. Generating 100 reasoning tokens means 100× more "compute" applied to the problem before giving an answer — essentially a soft form of multi-step computation that the model's fixed layers can't perform in a single pass.
  2. External working memory: LLMs have no internal state beyond the context window. Writing intermediate results to the context externalizes memory. Without CoT, intermediate values computed in early layers are lost before the answer layer is reached. With CoT, those values persist as tokens in context.
  3. Knowledge pathway activation: Reasoning through a problem step-by-step activates different, more relevant knowledge paths than jumping directly to an answer. The intermediate tokens serve as attention anchors that pull in more precise knowledge from the model's weights.

The faithfulness problem

Research (Turpin et al., 2023) found that models' verbalized CoT reasoning is sometimes "unfaithful" — the stated reasoning doesn't reflect what's actually driving the answer. When a biasing hint is added to the prompt, the model often changes its answer while rationalizing with different-looking but post-hoc reasoning. This matters for debugging: a correct CoT doesn't guarantee correct internal reasoning.

Reasoning models: o1, R1, and extended thinking

In late 2024, a new paradigm emerged: models trained (not just prompted) to reason extensively before answering. These models generate thousands of hidden "thinking" tokens — an internal scratchpad — before producing the visible response.

ModelLabApproachAIME 2024 scoreKey capability
GPT-4oOpenAIStandard SFT + RLHF13%Best general assistant without extended thinking
o1OpenAIRL-trained to reason; hidden chain-of-thought74%5.7× better on AIME; beats PhD on many domains
o3OpenAIScaled-up o1; adaptive compute budget96%Near-perfect on AIME; competitive coding champion level
DeepSeek-R1DeepSeekGroup Relative Policy Optimization (GRPO) on verifiable rewards79%Open-weights reasoning model matching o1
Claude 3.7 SonnetAnthropicExtended thinking mode: configurable token budget for reasoning~80%User-visible thinking traces; budget control
Gemini 2.0 Flash ThinkingGoogleDistilled reasoning into faster model~70%Fastest reasoning model as of early 2025

How they're trained differently

Standard CoT is a prompting technique that works at inference time. Reasoning models are trained with RL using verifiable reward signals — math problems where you can check whether the answer is correct, code that either passes tests or fails. The RL process discovers reasoning strategies that maximize correctness, leading to emergent behaviors like self-correction, exploration, and backtracking that weren't explicitly programd.

Limits and failure modes of CoT

CoT dramatically improves reasoning, but it is not a silver bullet. Understanding its failure modes is essential for reliable deployment.

Failure modeDescriptionExampleMitigation
Plausible but wrongCoherent reasoning steps lead to incorrect final answer"3 × 4 = 12, therefore total is 14" (arithmetic slip)Self-consistency; external verification
Error compoundingEarly mistake amplifies through chainWrong unit conversion → all subsequent steps wrongStructured problem decomposition; re-ask
Spurious reasoningStated reasoning is post-hoc rationalization, not actualModel changes answer when hint added but claims different reasoningFaithfulness probes; cross-check answers
Verbosity spiralMore steps ≠ more accuracy; model over-complicatesSimple addition solved in 15 verbose steps with errorInstruction: "Be concise, show only key steps"
Hallucinated facts mid-chainModel invents intermediate values"Wikipedia says X" — X does not existGrounding: tool calls for factual lookups within CoT

Never deploy CoT alone for high-stakes decisions

In medical, legal, or financial contexts, CoT reasoning that reads as authoritative can be confidently wrong. Always: (1) verify numeric outputs independently, (2) use retrieval-grounded CoT for factual claims, (3) add human review for consequential decisions. A model that thinks through 10 steps and reaches a wrong answer is more dangerous than one that says "I'm not sure" — the confident reasoning creates false trust.

Practice questions

  1. What empirical finding by Wei et al. (2022) established chain-of-thought as a major prompting technique? (Answer: Wei et al. (Google, 2022) showed that appending 'Let's think step by step' or providing multi-step reasoning examples dramatically improved performance on arithmetic, commonsense, and symbolic reasoning benchmarks — but ONLY for models above ~100B parameters. For smaller models, CoT hurt performance. This scale threshold finding was critical: it meant CoT is an emergent capability of large models, not a general prompting technique. The paper showed 40–60% accuracy improvements on GSM8K with CoT vs direct answering.)
  2. What is the difference between zero-shot CoT and few-shot CoT prompting? (Answer: Zero-shot CoT: simply append 'Let's think step by step.' to the prompt — no examples provided. Few-shot CoT: provide 3–8 examples of (question, reasoning chain, answer) before the target question. Few-shot CoT outperforms zero-shot CoT on complex reasoning tasks because the examples demonstrate the expected reasoning format and depth. Zero-shot CoT is simpler (no example curation) and often sufficient for well-defined problems. Few-shot CoT is preferred for novel reasoning patterns where the model needs to see the expected structure.)
  3. What is self-consistency decoding and how does it improve CoT performance? (Answer: Self-consistency (Wang et al. 2022): sample k reasoning chains independently (temperature > 0), execute each to get k answers, take majority vote. The diversity of reasoning paths reduces reliance on any single chain that may contain errors. Key insight: multiple correct paths lead to correct answers; multiple incorrect paths rarely agree on the same wrong answer. GSM8K improvement: CoT+self-consistency (k=40): 88% vs CoT alone: 57%. Trade-off: k× more inference compute and API cost.)
  4. What is the 'unfaithful reasoning' problem in chain-of-thought? (Answer: CoT reasoning chains may not reflect the model's actual internal computation. Lanham et al. (2023): models sometimes give incorrect CoT but correct final answers (unused reasoning), and correct CoT but incorrect answers (reasoning not actually guiding the output). Faithfulness of CoT is debated: the reasoning might be post-hoc rationalization of an answer computed through other mechanisms. This matters for safety: if a model's stated reasoning is unfaithful, we cannot use it to understand or verify model behavior.)
  5. When does CoT hurt performance compared to direct answering? (Answer: CoT hurts for: (1) Simple factual questions — 'What is the capital of France?' Adding 'Let me reason...' wastes tokens and can introduce errors. (2) Tasks that are pattern-matched from training data — models can answer faster and more accurately without reasoning steps. (3) Small models (<10B) — they lack the capacity to reason effectively in CoT; forced reasoning introduces errors. Rule: use CoT for tasks requiring multi-step computation or reasoning. Skip CoT for simple retrieval, classification, or pattern matching.)

CoT in 2026: prompting vs training — what changed

Chain-of-thought started as a prompting technique — something you add to a prompt. In 2024-2026 it became a training objective. Understanding the difference is critical for knowing which approach to use.

ApproachHow thinking happensWho controls itCostBest for
Zero-shot CoT (prompt)Add "think step by step" to the promptYou (via the prompt)Normal token costQuick improvement on math/logic with existing models
Few-shot CoT (prompt)Provide reasoning examples in the promptYou (via examples)More input tokensTasks with a specific reasoning format you want to enforce
Self-consistency (prompt)Generate N independent chains, majority voteYou (via sampling)N× token costHigh-stakes accuracy tasks where cost is acceptable
Reasoning models (trained)Model trained with RL to generate internal scratchpadThe model (emergent from training)3–10× standard token costCompetition math, complex code, PhD-level analysis
Extended thinking (API)Configure a token budget for thinking; optional visibilityYou (set budget) + model (what to think)Budget tokens + standard outputHard problems where you want thinking visible/auditable

The key insight from DeepSeek-R1

DeepSeek-R1 showed that if you train a model with reinforcement learning on problems with verifiable answers (math, code), chain-of-thought reasoning emerges spontaneously — the model discovers that thinking longer leads to better rewards. You don't need to collect human reasoning demonstrations. This means reasoning capability is fundamentally about training on the right reward signal, not about telling the model to "think step by step."

Practical decision tree for 2026: (1) Is your task one of: competition math, complex multi-file debugging, scientific analysis, or legal/financial reasoning where errors are costly? → Use a reasoning model (o3, DeepSeek-R1, Claude extended thinking). (2) Is your task simpler but benefits from structured reasoning? → Use zero-shot or few-shot CoT. (3) Is your task straightforward Q&A, creative writing, or classification? → Skip CoT entirely — it adds tokens without benefit and can hurt performance on simple tasks.

Practical CoT templates for common task types — copy and adapt

# Template 1: Mathematical problem solving
MATH_COT = """Solve the following problem. Show all work.

Format your response exactly as:
## Setting up the problem
[identify knowns, unknowns, relevant formulas]

## Step-by-step solution
[numbered steps, show each calculation]

## Verification
[check your answer makes sense]

## Final answer
[state the answer clearly]

Problem: {problem}"""

# Template 2: Code debugging
DEBUG_COT = """Debug the following code. Think through it systematically.

## What the code is supposed to do
[brief description of intent]

## Reading through the code
[trace execution, note what each section does]

## Identifying the bug
[exactly where and what is wrong, and why]

## The fix
[corrected code with explanation]

## Testing the fix
[confirm it works on the given test case]

Code:
```{language}
{code}
```
Error: {error}"""

# Template 3: Multi-source analysis
ANALYSIS_COT = """Analyze the following question. Be systematic.

## What is being asked
[restate the core question in your own words]

## Key considerations
[factors that affect the answer, tradeoffs]

## Evidence and reasoning
[for each consideration, what does the evidence say?]

## Counter-arguments
[what would someone who disagrees say?]

## Conclusion
[balanced assessment with clear reasoning]

Question: {question}"""

✦ Under $1 / day

Practice what you just learned

Quiz Hub + Study Mode lock in every concept. 40+ AI models, Agent Mode, page-locked answers — all for less than a dollar a day.

Start Free — Under $1/day

Related Terms

5 terms