What is practice questions?

Speculative Decoding: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/speculative-decoding

Speculative Decoding

Speculative decoding is an inference acceleration technique that uses a small, fast 'draft' model to speculatively generate several tokens ahead, which a larger 'target' model then verifies in parallel. Because transformer models can evaluate multiple token candidates in a single forward pass (via parallel attention), verifying k draft tokens costs only marginally more than verifying one — while potentially accepting all k tokens if the draft model was correct. The result is 2–4× faster generation from large models with mathematically identical output distributions.

Using a small draft model to make a large model generate 2–4× faster.

Category: Inference & Deployment

How speculative decoding works

Standard autoregressive generation is sequential: the large model generates one token, appends it, generates the next, and so on. Each forward pass is a bottleneck. Speculative decoding introduces a parallel draft-then-verify loop:

Draft phase: A small, fast model (2–7B parameters, ~10–20× faster than the target model) autoregressively generates k tokens (typically k=4–8) as candidates.
Verify phase: The large target model processes the original context plus all k draft tokens in a single forward pass — computing the probability distribution at each of the k positions in parallel.
Accept/reject: At each position i, if the target model agrees with the draft token (probability exceeds a threshold), it accepts. At the first disagreement, it rejects the draft token and all subsequent ones, sampling a corrected token from the target model's distribution.
Guaranteed equivalence: The acceptance sampling procedure is designed so that the accepted tokens have exactly the same distribution as if the large model generated them directly — no approximation or quality tradeoff.

Scenario	Tokens accepted	Speedup achieved
Draft model highly accurate (e.g., coding tasks)	4–6 of 5 draft tokens	2.5–4× over standard generation
Draft model moderately accurate (general text)	2–3 of 5 draft tokens	1.5–2.5× over standard generation
Draft model inaccurate (highly creative output)	1–2 of 5 draft tokens	~1.2× — diminishing benefit

Where speculative decoding is used in production: Google uses speculative decoding in Gemini inference infrastructure. Anthropic uses it for Claude API responses. Meta uses it in Llama deployment. It is enabled by default in vLLM (the dominant open-source LLM inference framework) when a matching draft model is available. For cloud API users, the speedup is transparent — you get faster responses without any API changes. For self-hosting teams, enabling speculative decoding is one of the highest-ROI inference optimizations available.

Practice questions

Why must the draft model and target model share the same tokenizer vocabulary? (Answer: Token-level verification requires the target model to evaluate the exact same token IDs proposed by the draft. If vocabularies differ, token IDs map to different words — making acceptance/rejection mathematically meaningless. All speculative decoding implementations enforce identical tokenizers between draft and target.)
Speculative decoding guarantees identical output distribution to standard autoregressive generation. How? (Answer: The acceptance-rejection sampling procedure ensures correctness. If target probability ≥ draft probability for a token, accept with probability 1. Otherwise accept with probability p_target/p_draft and resample from a corrected distribution. The combined process has exactly the same marginal distribution as the target model alone.)
A draft model accepts 6 of 8 proposed tokens on average. Estimate the speedup. (Answer: Effective tokens per target forward pass ≈ 6 + 1 (the correction token) = 7. Standard generation yields 1 token per pass. Raw speedup ≈ 7×, but verification passes are more expensive than standard passes (k+1 tokens processed). Practical speedup is typically 2.5–3.5× in this acceptance scenario.)
Why does speculative decoding help less for highly creative (high-temperature) generation? (Answer: Creative outputs require sampling from broad, flat distributions — the draft model trained on typical sequences struggles to predict atypical creative tokens. Acceptance rates drop to 1–2 tokens per draft batch, approaching standard generation speed. Speculative decoding is most efficient for constrained, predictable domains like code and structured text.)
What is self-speculative decoding and what is its main advantage? (Answer: Uses the target model's own early transformer layers (e.g., first 20 of 80) as a draft — no separate model needed. Eliminates the requirement for a draft model with the same tokenizer and avoids memory overhead of loading two models. Trade-off: lower acceptance rates than a dedicated draft model since early layers produce less accurate drafts.)

Scenario

Tokens accepted

Speedup achieved

Draft model highly accurate (e.g., coding tasks)

4–6 of 5 draft tokens

2.5–4× over standard generation

Draft model moderately accurate (general text)

2–3 of 5 draft tokens

1.5–2.5× over standard generation

Draft model inaccurate (highly creative output)

1–2 of 5 draft tokens

~1.2× — diminishing benefit

Speculative Decoding

How speculative decoding works

Practice questions

Speculative Decoding

How speculative decoding works

Practice questions

Practice what you just learned

Related Terms