Temperature is a hyperparameter that controls the randomness of token selection during LLM text generation. At temperature 0 the model always picks the highest-probability next token (deterministic, repetitive). At higher temperatures it samples from a wider distribution, producing more varied, creative — but potentially less accurate — responses. Sampling strategies like top-k and top-p (nucleus sampling) work alongside temperature to shape output quality.
How temperature works mathematically
Before sampling, the model's raw output is a vector of logits — one unnormalized score per vocabulary token. The softmax function converts these into probabilities. Temperature T divides every logit before softmax, controlling the sharpness of the distribution:
Softmax with temperature T. At T→0, the highest logit dominates and the distribution collapses to a one-hot. At T→∞, all tokens become equally likely.
| Temperature | Effect | Distribution shape | Best for |
|---|---|---|---|
| 0 (or ≈0.01) | Always picks the top token — fully deterministic | Sharp spike on one token | Code generation, maths, factual Q&A |
| 0.3–0.5 | Mostly deterministic, small variation | Narrow peak with some spread | Summarization, classification, structured data |
| 0.7–1.0 | Balanced creativity and coherence | Moderate spread | General conversation, essays, explanations |
| 1.2–1.5 | Creative and diverse but occasionally off-track | Flat, wide distribution | Brainstorming, poetry, creative writing |
| >2.0 | Near-random gibberish | Almost uniform — no meaningful signal | Not useful in practice |
The temperature=0 myth
Even at temperature=0, most APIs are not perfectly deterministic due to floating-point non-determinism across GPU hardware and batching. For true reproducibility, also set a fixed seed if the API supports it.
Top-k and Top-p (nucleus) sampling
Temperature alone isn't enough — even a well-shaped distribution can assign tiny probability to catastrophic tokens. Top-k and top-p sampling truncate the distribution before sampling, preventing rare tokens from ever being picked.
| Strategy | How it works | Hyperparameter | Tradeoff |
|---|---|---|---|
| Greedy | Always pick the highest-probability token | None | Deterministic but repetitive |
| Temperature | Rescale all logits before softmax | T (0–2) | Global — every token is affected |
| Top-k | Restrict sampling to the k most likely tokens, redistribute probability to zero elsewhere | k (e.g. 40–100) | Fixed vocabulary size regardless of how flat/sharp the distribution is |
| Top-p (nucleus) | Keep only the smallest set of tokens whose cumulative probability ≥ p | p (e.g. 0.9–0.95) | Adaptive — keeps more tokens when distribution is flat, fewer when it's sharp |
| Min-p | Keep tokens with probability > p × top-token-probability | p (e.g. 0.05) | Newer; scales threshold relative to the model's confidence |
Manual nucleus (top-p) sampling — exactly what happens inside every LLM API
import torch
import torch.nn.functional as F
def top_p_sample(logits: torch.Tensor, temperature: float = 0.9, top_p: float = 0.9) -> int:
"""
Nucleus sampling with temperature.
logits: raw unnormalized scores, shape (vocab_size,)
"""
# 1. Apply temperature
scaled = logits / max(temperature, 1e-8)
# 2. Softmax → probabilities
probs = F.softmax(scaled, dim=-1)
# 3. Sort tokens by probability (highest first)
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
# 4. Compute cumulative sum; find the nucleus boundary
cumulative = torch.cumsum(sorted_probs, dim=-1)
# 5. Remove tokens once cumulative prob exceeds top_p
# (shift by 1 so we always keep at least 1 token)
remove_mask = cumulative - sorted_probs > top_p
sorted_probs[remove_mask] = 0.0
# 6. Renormalize and sample
sorted_probs /= sorted_probs.sum()
sampled_pos = torch.multinomial(sorted_probs, num_samples=1)
return sorted_indices[sampled_pos].item()Practical settings by task
| Task type | Recommended temperature | Top-p | Notes |
|---|---|---|---|
| Code generation | 0–0.2 | 0.95 | Low temperature essential; syntax errors compound |
| Factual Q&A / RAG | 0–0.3 | 0.9 | Accuracy over creativity; hallucinations increase with T |
| Summarization | 0.3–0.5 | 0.9 | Some variation acceptable, faithfulness important |
| Chat / customer support | 0.6–0.8 | 0.9 | Natural-sounding without losing coherence |
| Creative writing / brainstorming | 0.9–1.2 | 0.95 | Diversity is desirable; humans can filter |
| Roleplay / fiction | 1.0–1.3 | 0.95–1.0 | Unexpected word choices enhance immersion |
Top-p vs temperature: use both
The most robust production setting combines both: temperature controls the softness of the distribution, top-p prevents low-probability tokens from ever being sampled regardless of temperature. The combination outperforms either alone. A good default: temperature=0.7, top_p=0.9.
Practice questions
- What happens when you set temperature=0 in an LLM API call? (Answer: Temperature=0 selects the highest probability token at every step — fully deterministic greedy decoding. The same prompt will always produce the same output (with seed fixed). Use for: factual Q&A, code generation where consistency matters, tests. Avoid for: creative writing, brainstorming, where diversity of outputs is valuable.)
- Top-p=0.9 means the model samples from tokens whose cumulative probability equals 90%. For a distribution where one token has probability 0.95, what tokens are eligible? (Answer: Just that one token — it alone accounts for 95% > 90% of probability mass. With top-p=0.9, the smallest set of tokens totalling ≥90% is just this single dominant token. This is the key advantage over top-k: top-p automatically collapses to near-greedy when the model is highly confident, giving creativity only when the model is genuinely uncertain about what comes next.)
- What is repetition penalty in LLM sampling and when is it necessary? (Answer: Repetition penalty discounts logits for tokens that have already appeared in the generated text: effective_logit = original_logit / penalty if token appeared, else original_logit. penalty > 1.0 reduces probability of repeating tokens. Default is 1.0 (no penalty). Necessary for models that fall into repetition loops (common without penalty for long generation). Over-penalisation can prevent legitimate word repetition (in lists, technical terms). Typical useful range: 1.1–1.3.)
- Why might you use min-p sampling instead of top-k or top-p? (Answer: Min-p: filter tokens whose probability is less than min_p × (probability of the top token). Unlike top-k (fixed count regardless of distribution) or top-p (fixed mass), min-p adapts relative to the strongest option. When top token is at 80%, min-p=0.05 keeps tokens with probability ≥4% — very few. When top token is at 10%, min-p=0.05 keeps tokens with probability ≥0.5% — many options. Maintains consistent relative confidence filtering across all probability distributions.)
- A customer service chatbot uses temperature=0.9 for all responses. What problem might arise? (Answer: High temperature introduces randomness — the bot may give inconsistent answers to the same question, different pricing information in different conversations, varying support procedures. For factual, policy-based responses (return policy, pricing, troubleshooting steps), temperature should be low (0.0–0.3). High temperature is appropriate for creative tasks, not deterministic information retrieval. Many production systems use temperature=0 for factual queries.)
On LumiChats
In LumiChats, you can adjust temperature for each conversation context — lower for precise research tasks, higher for creative brainstorming. The default is tuned for balanced accuracy and natural-sounding conversation.
Try it free