Inference & Model Serving
Inference is the process of running a trained AI model to generate predictions or outputs — as opposed to training, which is the process of adjusting model weights. In LLMs, inference is autoregressive token generation: predicting one token at a time from left to right. Efficient inference is critical for cost and latency in production AI systems.
How AI models run in production at scale.
Category: Inference & Deployment
How LLM inference works
LLM inference has two distinct phases with very different computational characteristics — understanding them is key to optimizing serving cost and latency.
| Phase | What happens | Compute type | Bottleneck |
|---|---|---|---|
| Prefill | All input (prompt) tokens processed in a single parallel forward pass | Compute-bound: all tokens computed in parallel | FLOPS — add more GPUs to speed up |
| Decode | Generate output tokens one at a time, each requiring a full forward pass | Memory-bandwidth-bound: reads all KV cache per step | GPU memory bandwidth — hard to parallelize |
P(w_t | w_1, \ldots, w_{t-1}) = \text{softmax}(W_U \cdot h_t)[w_t]
Why decode is memory-bandwidth-bound: During decode, generating each token requires reading all model weights (~140GB for a 70B model in FP16) from GPU HBM to SRAM — but only performs a tiny amount of compute (one token's worth). An H100 has 3.35TB/s memory bandwidth but 2,000 TFLOPS. For a single-token forward pass, the model is memory-bound by ~1000×. This is why batching decode requests dramatically improves GPU utilization — amortizing the weight read cost across many concurrent tokens.
KV cache and memory management
The KV cache is the single most important memory structure in LLM inference — understanding it explains why long contexts are expensive and why batching is complex.
\text{KV cache size} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times T \times B \times \text{bytes}
| Technique | What it does | Memory savings |
|---|---|---|
| KV cache | Cache past key-value pairs; reuse in each decode step | O(n²) → O(n) compute; enables long contexts |
| PagedAttention (vLLM) | Pages KV cache like OS virtual memory; shares pages across requests | 20–40% higher GPU utilization; no memory fragmentation |
| Quantized KV cache | Store KV cache in INT8 or FP8 instead of FP16 | 2–4× memory reduction with minimal quality loss |
| Sliding window attention | Only keep KV cache for last W tokens (e.g., W=4096) | O(1) memory instead of O(n) — at cost of long-range attention |
| Multi-Query Attention (MQA) | Share K and V heads across all attention heads | 8–32× smaller KV cache (used in LLaMA 3, Mistral, Gemma) |
Inference optimization techniques
| Technique | Mechanism | Speedup | Tradeoff |
|---|---|---|---|
| Speculative decoding | Draft model generates k tokens fast; main model verifies all k in one parallel pass | 2–3× | Requires matching draft model; benefit varies with acceptance rate |
| Continuous batching | Process tokens from multiple requests in same batch; replace finished sequences immediately | 5–10× throughput | Higher latency for individual requests |
| FlashAttention 2/3 | Fused CUDA kernel keeps Q,K,V in fast SRAM; avoids HBM round-trips | 2–4× attention speed, 5–20× memory | NVIDIA/AMD specific; needs CUDA |
| Tensor parallelism | Split attention heads or FFN dimensions across GPUs; all-reduce each layer | Linear with # GPUs | Communication overhead; needs fast interconnect (NVLink) |
| Pipeline parallelism | Different model layers on different GPUs; micro-batching to hide bubbles | Linear with # GPUs | Micro-batch latency; bubble overhead |
| AWQ / GPTQ quantization | Quantize weights to INT4/INT8; reduce memory bandwidth bottleneck | 1.5–4× throughput | Slight quality loss; calibration required |
Speculative decoding in depth: Speculative decoding works because: (1) generating draft tokens with a small model (e.g., 3B) is much faster than the main model (70B), and (2) verifying k tokens in parallel with the main model is no slower than generating 1 token — the forward pass is the same shape. If the draft model has ~80% token acceptance rate, you get ~3× speedup for free with identical output. Claude uses speculative decoding in production; Medusa (self-speculative with multiple heads) avoids needing a separate draft model.
LLM inference infrastructure in 2025
| Framework / Service | Type | Best for | Key feature |
|---|---|---|---|
| vLLM | Open-source server | Production throughput-optimized serving | PagedAttention, continuous batching, multi-LoRA |
| Ollama | Open-source local | Local dev, single-machine serving | One-command model download + serve; GGUF support |
| llama.cpp | Open-source library | CPU inference, low-VRAM GPU, edge deployment | Quantized GGUF; CPU+GPU split; runs on MacBooks |
| TensorRT-LLM | NVIDIA framework | Maximum performance on NVIDIA GPUs | FP8, kernel fusion, speculative decoding; H100 optimized |
| SGLang | Open-source server | Structured generation, complex multi-call workflows | RadixAttention (KV cache sharing across similar prefixes) |
| Groq LPU | Cloud inference | Fastest token generation speed | Custom LPU chip: 500+ tokens/sec on 70B; not cheapest |
| Together AI / Fireworks | Managed API | Cheap open-source model inference | Per-token pricing, open-source model access |
| AWS Bedrock / Vertex AI | Enterprise managed | Enterprise compliance + multi-provider access | SLA, VPC, audit logging, fine-tune hosting |
Cost benchmark (early 2025): GPT-4o: ~$10–15/M output tokens. Claude 3.5 Sonnet: ~$15/M. Llama 3.1 70B via Together AI: ~$0.88/M. Self-hosted Llama 3.1 70B on vLLM (4× A100 80GB): ~$0.20/M at full utilization. The 50–75× cost gap between frontier closed models and self-hosted open-source explains why companies with high token volumes increasingly fine-tune open-source models for production.
Latency vs throughput tradeoffs
Latency and throughput are fundamentally in tension for LLM serving — optimizing one hurts the other. Choosing the right operating point depends on your use case.
| Metric | Definition | Typical target | Critical for |
|---|---|---|---|
| TTFT (Time To First Token) | Time from request sent to first token received | <500ms for interactive | Chatbots, coding assistants — perceived responsiveness |
| TPOT (Time Per Output Token) | Average time between consecutive output tokens | <50ms (~20 tok/s) | Streaming readability — faster than human reading speed |
| End-to-end latency | Total time from request to complete response | <5s for short responses | Non-streaming batch use cases |
| Throughput (tokens/sec) | Total tokens generated per second across all requests | Maximize for batch | Document processing, offline summarization pipelines |
| Requests per second (RPS) | Concurrent requests served | Varies by batch size | API scaling, cost efficiency |
Streaming and perceived latency: Streaming (Server-Sent Events, SSE) returns tokens as they are generated — the user sees text appearing word-by-word rather than waiting for the full response. This dramatically improves perceived responsiveness even if total generation time is identical. A response that takes 5s to complete feels fast if you see the first tokens in 200ms. All major LLM APIs (OpenAI, Anthropic, Groq) support streaming; always use it for interactive applications.
Practice questions
- What is the difference between throughput and latency in LLM inference, and why is there a fundamental trade-off? (Answer: Latency: time from request to first token (TTFT) + time to generate full response. User-facing — measures how fast responses feel. Throughput: total tokens generated per second across all concurrent users. Server-side — measures capacity. Trade-off: batching requests improves throughput (processes many tokens simultaneously) but increases latency for individual users (must wait for batch). At batch_size=1: minimum latency. At batch_size=256: maximum throughput. Production serving optimizes the Pareto frontier between these, using continuous batching to approach both simultaneously.)
- What is continuous batching (iteration-level scheduling) and why did it transform LLM serving? (Answer: Traditional static batching: group N requests together, wait until ALL finish generating, then start next batch. If request 1 finishes in 10 tokens and request 2 in 1000 tokens, request 1's GPU slot sits idle for 990 token-steps. Continuous batching (Orca, vLLM): as soon as a request finishes, its slot is immediately replaced with a new request. The batch changes composition at every token generation step. Result: GPU utilization goes from ~20% (static) to ~80%+ (continuous). vLLM pioneered this; it is now the standard in all production LLM serving systems.)
- What is TTFT (Time to First Token) and why is it more important than total generation time for user experience? (Answer: TTFT: elapsed time from request submission until the first output token is generated. Covers: network latency + prompt processing (prefill) + scheduling queue wait. User experience: TTFT determines how quickly the UI can show 'something is happening.' A response that streams from token 1 in 500ms feels faster than a response that starts in 2000ms — even if both complete in 5 seconds. This is why streaming is universal in production LLM APIs: show the first token immediately rather than waiting for completion.)
- What hardware is used for LLM inference and what determines model serving cost? (Answer: Primary hardware: NVIDIA H100 (80GB, $30K), H200 (141GB, $40K), A100 (80GB, $10K). AMD MI300X: competitive, gaining traction. Google TPUv5: used for internal Google serving. Cost drivers: (1) GPU VRAM (must hold model weights + KV cache). (2) GPU compute (tokens/sec per GPU). (3) Memory bandwidth (memory-bandwidth-bound decoding phase). Pricing: H100 SXM: $2–3/GPU-hour on cloud. Serving a 70B model: ~4 H100s needed, ~$8-12/hour, ~100 tokens/second → $0.023–0.033/1K output tokens (similar to commercial API pricing).)
- What is PagedAttention (used in vLLM) and how does it reduce memory waste in LLM serving? (Answer: Standard KV cache: pre-allocated contiguously for max_sequence_length. For 2048-token max: 2048 positions reserved even if request only generates 100 tokens → 95% waste. PagedAttention (Kwon et al. 2023): divides KV cache into fixed-size pages (blocks), allocating pages on demand like virtual memory. Non-contiguous pages are accessed via a block table. Result: near-zero internal fragmentation, memory utilization from 20–30% to 90%+, supports 2–4× more concurrent requests on same hardware. PagedAttention is the core innovation of vLLM.)