Glossary/Inference & Model Serving
Inference & Deployment

Inference & Model Serving

How AI models run in production at scale.


Definition

Inference is the process of running a trained AI model to generate predictions or outputs — as opposed to training, which is the process of adjusting model weights. In LLMs, inference is autoregressive token generation: predicting one token at a time from left to right. Efficient inference is critical for cost and latency in production AI systems.

How LLM inference works

LLM inference has two distinct phases with very different computational characteristics — understanding them is key to optimizing serving cost and latency.

PhaseWhat happensCompute typeBottleneck
PrefillAll input (prompt) tokens processed in a single parallel forward passCompute-bound: all tokens computed in parallelFLOPS — add more GPUs to speed up
DecodeGenerate output tokens one at a time, each requiring a full forward passMemory-bandwidth-bound: reads all KV cache per stepGPU memory bandwidth — hard to parallelize

Autoregressive generation: at each step t, the model produces hidden state h_t from all preceding tokens, projects to vocabulary via W_U (unembedding matrix), and samples the next token. With KV cache, h_t computation only requires the new token — past key-value pairs are cached.

Why decode is memory-bandwidth-bound

During decode, generating each token requires reading all model weights (~140GB for a 70B model in FP16) from GPU HBM to SRAM — but only performs a tiny amount of compute (one token's worth). An H100 has 3.35TB/s memory bandwidth but 2,000 TFLOPS. For a single-token forward pass, the model is memory-bound by ~1000×. This is why batching decode requests dramatically improves GPU utilization — amortizing the weight read cost across many concurrent tokens.

KV cache and memory management

The KV cache is the single most important memory structure in LLM inference — understanding it explains why long contexts are expensive and why batching is complex.

KV cache memory formula: 2 (K and V) × layers × heads × head_dim × sequence_length (T) × batch_size (B) × bytes per element. For LLaMA 3 70B: 2 × 80 × 8 × 128 × T × B × 2 bytes (FP16) = 0.32MB per token per batch element. At T=8K, B=32: 82GB just for KV cache.

TechniqueWhat it doesMemory savings
KV cacheCache past key-value pairs; reuse in each decode stepO(n²) → O(n) compute; enables long contexts
PagedAttention (vLLM)Pages KV cache like OS virtual memory; shares pages across requests20–40% higher GPU utilization; no memory fragmentation
Quantized KV cacheStore KV cache in INT8 or FP8 instead of FP162–4× memory reduction with minimal quality loss
Sliding window attentionOnly keep KV cache for last W tokens (e.g., W=4096)O(1) memory instead of O(n) — at cost of long-range attention
Multi-Query Attention (MQA)Share K and V heads across all attention heads8–32× smaller KV cache (used in LLaMA 3, Mistral, Gemma)

Inference optimization techniques

TechniqueMechanismSpeedupTradeoff
Speculative decodingDraft model generates k tokens fast; main model verifies all k in one parallel pass2–3×Requires matching draft model; benefit varies with acceptance rate
Continuous batchingProcess tokens from multiple requests in same batch; replace finished sequences immediately5–10× throughputHigher latency for individual requests
FlashAttention 2/3Fused CUDA kernel keeps Q,K,V in fast SRAM; avoids HBM round-trips2–4× attention speed, 5–20× memoryNVIDIA/AMD specific; needs CUDA
Tensor parallelismSplit attention heads or FFN dimensions across GPUs; all-reduce each layerLinear with # GPUsCommunication overhead; needs fast interconnect (NVLink)
Pipeline parallelismDifferent model layers on different GPUs; micro-batching to hide bubblesLinear with # GPUsMicro-batch latency; bubble overhead
AWQ / GPTQ quantizationQuantize weights to INT4/INT8; reduce memory bandwidth bottleneck1.5–4× throughputSlight quality loss; calibration required

Speculative decoding in depth

Speculative decoding works because: (1) generating draft tokens with a small model (e.g., 3B) is much faster than the main model (70B), and (2) verifying k tokens in parallel with the main model is no slower than generating 1 token — the forward pass is the same shape. If the draft model has ~80% token acceptance rate, you get ~3× speedup for free with identical output. Claude uses speculative decoding in production; Medusa (self-speculative with multiple heads) avoids needing a separate draft model.

LLM inference infrastructure in 2025

Framework / ServiceTypeBest forKey feature
vLLMOpen-source serverProduction throughput-optimized servingPagedAttention, continuous batching, multi-LoRA
OllamaOpen-source localLocal dev, single-machine servingOne-command model download + serve; GGUF support
llama.cppOpen-source libraryCPU inference, low-VRAM GPU, edge deploymentQuantized GGUF; CPU+GPU split; runs on MacBooks
TensorRT-LLMNVIDIA frameworkMaximum performance on NVIDIA GPUsFP8, kernel fusion, speculative decoding; H100 optimized
SGLangOpen-source serverStructured generation, complex multi-call workflowsRadixAttention (KV cache sharing across similar prefixes)
Groq LPUCloud inferenceFastest token generation speedCustom LPU chip: 500+ tokens/sec on 70B; not cheapest
Together AI / FireworksManaged APICheap open-source model inferencePer-token pricing, open-source model access
AWS Bedrock / Vertex AIEnterprise managedEnterprise compliance + multi-provider accessSLA, VPC, audit logging, fine-tune hosting

Cost benchmark (early 2025)

GPT-4o: ~$10–15/M output tokens. Claude 3.5 Sonnet: ~$15/M. Llama 3.1 70B via Together AI: ~$0.88/M. Self-hosted Llama 3.1 70B on vLLM (4× A100 80GB): ~$0.20/M at full utilization. The 50–75× cost gap between frontier closed models and self-hosted open-source explains why companies with high token volumes increasingly fine-tune open-source models for production.

Latency vs throughput tradeoffs

Latency and throughput are fundamentally in tension for LLM serving — optimizing one hurts the other. Choosing the right operating point depends on your use case.

MetricDefinitionTypical targetCritical for
TTFT (Time To First Token)Time from request sent to first token received<500ms for interactiveChatbots, coding assistants — perceived responsiveness
TPOT (Time Per Output Token)Average time between consecutive output tokens<50ms (~20 tok/s)Streaming readability — faster than human reading speed
End-to-end latencyTotal time from request to complete response<5s for short responsesNon-streaming batch use cases
Throughput (tokens/sec)Total tokens generated per second across all requestsMaximize for batchDocument processing, offline summarization pipelines
Requests per second (RPS)Concurrent requests servedVaries by batch sizeAPI scaling, cost efficiency

Streaming and perceived latency

Streaming (Server-Sent Events, SSE) returns tokens as they are generated — the user sees text appearing word-by-word rather than waiting for the full response. This dramatically improves perceived responsiveness even if total generation time is identical. A response that takes 5s to complete feels fast if you see the first tokens in 200ms. All major LLM APIs (OpenAI, Anthropic, Groq) support streaming; always use it for interactive applications.

Practice questions

  1. What is the difference between throughput and latency in LLM inference, and why is there a fundamental trade-off? (Answer: Latency: time from request to first token (TTFT) + time to generate full response. User-facing — measures how fast responses feel. Throughput: total tokens generated per second across all concurrent users. Server-side — measures capacity. Trade-off: batching requests improves throughput (processes many tokens simultaneously) but increases latency for individual users (must wait for batch). At batch_size=1: minimum latency. At batch_size=256: maximum throughput. Production serving optimises the Pareto frontier between these, using continuous batching to approach both simultaneously.)
  2. What is continuous batching (iteration-level scheduling) and why did it transform LLM serving? (Answer: Traditional static batching: group N requests together, wait until ALL finish generating, then start next batch. If request 1 finishes in 10 tokens and request 2 in 1000 tokens, request 1's GPU slot sits idle for 990 token-steps. Continuous batching (Orca, vLLM): as soon as a request finishes, its slot is immediately replaced with a new request. The batch changes composition at every token generation step. Result: GPU utilisation goes from ~20% (static) to ~80%+ (continuous). vLLM pioneered this; it is now the standard in all production LLM serving systems.)
  3. What is TTFT (Time to First Token) and why is it more important than total generation time for user experience? (Answer: TTFT: elapsed time from request submission until the first output token is generated. Covers: network latency + prompt processing (prefill) + scheduling queue wait. User experience: TTFT determines how quickly the UI can show 'something is happening.' A response that streams from token 1 in 500ms feels faster than a response that starts in 2000ms — even if both complete in 5 seconds. This is why streaming is universal in production LLM APIs: show the first token immediately rather than waiting for completion.)
  4. What hardware is used for LLM inference and what determines model serving cost? (Answer: Primary hardware: NVIDIA H100 (80GB, $30K), H200 (141GB, $40K), A100 (80GB, $10K). AMD MI300X: competitive, gaining traction. Google TPUv5: used for internal Google serving. Cost drivers: (1) GPU VRAM (must hold model weights + KV cache). (2) GPU compute (tokens/sec per GPU). (3) Memory bandwidth (memory-bandwidth-bound decoding phase). Pricing: H100 SXM: $2–3/GPU-hour on cloud. Serving a 70B model: ~4 H100s needed, ~$8-12/hour, ~100 tokens/second → $0.023–0.033/1K output tokens (similar to commercial API pricing).)
  5. What is PagedAttention (used in vLLM) and how does it reduce memory waste in LLM serving? (Answer: Standard KV cache: pre-allocated contiguously for max_sequence_length. For 2048-token max: 2048 positions reserved even if request only generates 100 tokens → 95% waste. PagedAttention (Kwon et al. 2023): divides KV cache into fixed-size pages (blocks), allocating pages on demand like virtual memory. Non-contiguous pages are accessed via a block table. Result: near-zero internal fragmentation, memory utilisation from 20–30% to 90%+, supports 2–4× more concurrent requests on same hardware. PagedAttention is the core innovation of vLLM.)

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms