Glossary/Speculative Decoding
Inference & Deployment

Speculative Decoding

Using a small draft model to make a large model generate 2–4× faster.


Definition

Speculative decoding is an inference acceleration technique that uses a small, fast 'draft' model to speculatively generate several tokens ahead, which a larger 'target' model then verifies in parallel. Because transformer models can evaluate multiple token candidates in a single forward pass (via parallel attention), verifying k draft tokens costs only marginally more than verifying one — while potentially accepting all k tokens if the draft model was correct. The result is 2–4× faster generation from large models with mathematically identical output distributions.

How speculative decoding works

Standard autoregressive generation is sequential: the large model generates one token, appends it, generates the next, and so on. Each forward pass is a bottleneck. Speculative decoding introduces a parallel draft-then-verify loop:

  1. Draft phase: A small, fast model (2–7B parameters, ~10–20× faster than the target model) autoregressively generates k tokens (typically k=4–8) as candidates.
  2. Verify phase: The large target model processes the original context plus all k draft tokens in a single forward pass — computing the probability distribution at each of the k positions in parallel.
  3. Accept/reject: At each position i, if the target model agrees with the draft token (probability exceeds a threshold), it accepts. At the first disagreement, it rejects the draft token and all subsequent ones, sampling a corrected token from the target model's distribution.
  4. Guaranteed equivalence: The acceptance sampling procedure is designed so that the accepted tokens have exactly the same distribution as if the large model generated them directly — no approximation or quality tradeoff.
ScenarioTokens acceptedSpeedup achieved
Draft model highly accurate (e.g., coding tasks)4–6 of 5 draft tokens2.5–4× over standard generation
Draft model moderately accurate (general text)2–3 of 5 draft tokens1.5–2.5× over standard generation
Draft model inaccurate (highly creative output)1–2 of 5 draft tokens~1.2× — diminishing benefit

Where speculative decoding is used in production

Google uses speculative decoding in Gemini inference infrastructure. Anthropic uses it for Claude API responses. Meta uses it in Llama deployment. It is enabled by default in vLLM (the dominant open-source LLM inference framework) when a matching draft model is available. For cloud API users, the speedup is transparent — you get faster responses without any API changes. For self-hosting teams, enabling speculative decoding is one of the highest-ROI inference optimisations available.

Practice questions

  1. Why must the draft model and target model share the same tokenizer vocabulary? (Answer: Token-level verification requires the target model to evaluate the exact same token IDs proposed by the draft. If vocabularies differ, token IDs map to different words — making acceptance/rejection mathematically meaningless. All speculative decoding implementations enforce identical tokenizers between draft and target.)
  2. Speculative decoding guarantees identical output distribution to standard autoregressive generation. How? (Answer: The acceptance-rejection sampling procedure ensures correctness. If target probability ≥ draft probability for a token, accept with probability 1. Otherwise accept with probability p_target/p_draft and resample from a corrected distribution. The combined process has exactly the same marginal distribution as the target model alone.)
  3. A draft model accepts 6 of 8 proposed tokens on average. Estimate the speedup. (Answer: Effective tokens per target forward pass ≈ 6 + 1 (the correction token) = 7. Standard generation yields 1 token per pass. Raw speedup ≈ 7×, but verification passes are more expensive than standard passes (k+1 tokens processed). Practical speedup is typically 2.5–3.5× in this acceptance scenario.)
  4. Why does speculative decoding help less for highly creative (high-temperature) generation? (Answer: Creative outputs require sampling from broad, flat distributions — the draft model trained on typical sequences struggles to predict atypical creative tokens. Acceptance rates drop to 1–2 tokens per draft batch, approaching standard generation speed. Speculative decoding is most efficient for constrained, predictable domains like code and structured text.)
  5. What is self-speculative decoding and what is its main advantage? (Answer: Uses the target model's own early transformer layers (e.g., first 20 of 80) as a draft — no separate model needed. Eliminates the requirement for a draft model with the same tokenizer and avoids memory overhead of loading two models. Trade-off: lower acceptance rates than a dedicated draft model since early layers produce less accurate drafts.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms