A context window is the maximum number of tokens an AI model can process in a single request — including both input (your messages, files, instructions) and output (the AI's response). Everything outside the context window is invisible to the model. Modern context windows range from 32,000 tokens (Mistral 7B) to 1,000,000 tokens (Gemini 1.5 Pro).
What goes into the context window
The context window is a single flat sequence of tokens — not separate memory buckets. Everything competes for the same space:
- System prompt — platform instructions, persona, injected RAG context
- Conversation history — every prior user message and assistant response
- Uploaded content — PDFs, code, data (inserted as text tokens)
- Current response — tokens being generated consume context as they're produced
Estimating context usage before sending a request
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 / Claude tokenizer ≈ equivalent
def count_tokens(text: str) -> int:
return len(enc.encode(text))
system_prompt = "You are a study assistant..." # 500 tokens
conversation = "User: explain X
Assistant: ..." # 2,000 tokens
uploaded_pdf = open("chapter3.txt").read() # might be 8,000 tokens
current_query = "Now summarize pages 30-45" # 10 tokens
total_input = (count_tokens(system_prompt)
+ count_tokens(conversation)
+ count_tokens(uploaded_pdf)
+ count_tokens(current_query))
max_context = 128_000 # GPT-4o context
remaining = max_context - total_input
print(f"Input tokens used: {total_input:,}")
print(f"Remaining for response: {remaining:,}")
print(f"Context used: {total_input/max_context:.1%}")Rule of thumb
1 token ≈ 0.75 words ≈ 4 characters in English. A standard A4 page of text ≈ 400 tokens. A 200-page textbook ≈ 80,000–120,000 tokens. GPT-4o's 128K context fits roughly a full novel.
Context windows of popular models (2025)
| Model | Context window | ≈ Words | ≈ Pages | Provider |
|---|---|---|---|---|
| Claude Sonnet 3.7 | 200,000 tokens | 150,000 | ~500 | Anthropic |
| Gemini 1.5 Pro | 1,000,000 tokens | 750,000 | ~2,500 | |
| GPT-4o | 128,000 tokens | 96,000 | ~320 | OpenAI |
| DeepSeek V3 | 128,000 tokens | 96,000 | ~320 | DeepSeek |
| LLaMA 3 70B | 128,000 tokens | 96,000 | ~320 | Meta (open) |
| Mistral Large | 128,000 tokens | 96,000 | ~320 | Mistral AI |
Larger context windows expand what's possible — loading an entire codebase, a book, a legal document — but they come at a cost: attention computation is O(n²) in sequence length, so longer contexts are proportionally more expensive to process.
Attention is quadratic in sequence length n and linear in model dimension d. Doubling context length quadruples attention compute.
What happens when you exceed the limit
When a conversation exceeds the context window, content must be dropped — typically the oldest messages first (a FIFO truncation). The model silently loses access to earlier context, which can cause contradictions, repetition, or loss of task continuity.
- RAG (Retrieval-Augmented Generation) — retrieve only the most relevant chunks at query time instead of loading everything. Stays well within context limits regardless of total document size.
- Summarization — compress older conversation turns into a running summary, preserving the gist without every token.
- Context distillation — use a smaller model to compress context before passing to the main model.
- Larger context model — upgrade to Gemini 1.5 Pro (1M tokens) or Claude's 200K for long-document use cases.
Silent failure
Context overflow usually fails silently — no error is thrown. The model simply can't access the dropped content. Monitor token usage in production to detect truncation before it affects output quality.
The 'lost in the middle' problem
Larger context windows don't guarantee uniform retrieval quality across them. Liu et al. (2023) showed a striking primacy-recency bias in LLM attention:
Lost in the middle (Liu et al., 2023)
When relevant information is placed in the middle of a long context, LLMs perform significantly worse at retrieving it compared to information placed at the beginning or end. Performance can drop by 20+ percentage points between beginning/end placement vs middle placement.
This means 1M token context is not equivalent to 1M tokens of perfectly accessible memory. Practical implication: structure prompts so that the most critical information appears at the start or end — not buried in the middle. RAG helps by surfacing the most relevant chunks to prominent positions.
| Information placement | Retrieval accuracy | Recommendation |
|---|---|---|
| Beginning of context | High (primacy effect) | Put key instructions and critical facts here |
| End of context | High (recency effect) | Put the actual question/task here |
| Middle of context | Significantly degraded | Avoid placing critical information here |
KV cache: how context windows are stored in GPU memory
During autoregressive generation, each Transformer layer computes key (K) and value (V) matrices for every input token. These must be cached so that generating token n+1 doesn't require recomputing all previous tokens from scratch.
KV cache grows linearly with sequence length T. For LLaMA 3 70B (80 layers, 64 heads, d_head=128), FP16: ~10 GB for 32K tokens, ~40 GB for 128K tokens.
Computing KV cache memory requirements for any model
def kv_cache_gb(
n_layers: int,
n_kv_heads: int,
d_head: int,
seq_len: int,
dtype_bytes: int = 2, # 2 = FP16/BF16, 1 = INT8, 4 = FP32
batch_size: int = 1
) -> float:
"""Compute KV cache memory in gigabytes."""
# 2 = K and V (both need to be cached)
bytes_total = (2 * n_layers * n_kv_heads * d_head
* seq_len * batch_size * dtype_bytes)
return bytes_total / (1024 ** 3)
# LLaMA 3 70B with 128K context (FP16)
print(f"LLaMA 3 70B @ 128K ctx: {kv_cache_gb(80, 8, 128, 128_000):.1f} GB")
# → ~21 GB just for KV cache (on top of ~140 GB model weights in FP16)
# GPT-2 Small (tiny model) for comparison
print(f"GPT-2 Small @ 4K ctx: {kv_cache_gb(12, 12, 64, 4_096):.3f} GB")
# → ~0.018 GB — negligibleWhy long-context serving is expensive
Serving 100 simultaneous users with 128K context on a 70B model requires ~2 TB of GPU memory just for KV caches (100 × 21 GB). This is why long-context inference costs significantly more than short-context inference, and why techniques like paged attention (vLLM) and GQA are critical for production deployments.
Practice questions
- What is the 'lost in the middle' problem and how should you structure prompts to avoid it? (Answer: Liu et al. (2023): LLMs retrieve information better from the beginning and end of the context window — information in the middle receives less attention. For a 10-document RAG prompt, the most relevant document should be first or last. Practical strategies: (1) Put critical instructions at the start AND repeat key constraints at the end. (2) For RAG: most relevant retrieved chunk first, then others, with the query last. (3) Use chunking strategies that keep related content contiguous rather than split across the context middle.)
- What is the difference between a model's training context length and its effective context length? (Answer: Training context length: the maximum sequence length used during training (where positional encodings are defined). Effective context length: how well the model actually uses information throughout the full window. Models often degrade in accuracy for facts placed deep in the middle of long contexts even within their stated window. Evaluation: NIAH (Needle in a Haystack) test — hide a fact at varying positions in a long document and measure retrieval accuracy. Many models claiming 128K context score poorly on NIAH beyond 32K.)
- Claude has a 200K context window. What is the approximate page count this can handle? (Answer: 1 page of text ≈ 250 words ≈ 325 tokens. 200,000 tokens / 325 ≈ 615 pages. In practice: a 400-page novel ≈ 130K tokens. A 200K context can hold: a complete novel + research paper + conversation history simultaneously. API pricing: input tokens at ~$3/million means a 200K context costs ~$0.60 per call. For document Q&A workflows: feed the entire document in context rather than chunking, but be aware of the 'lost in the middle' degradation for very long inputs.)
- What is KV cache compression and how does it extend effective context length? (Answer: KV cache stores all past token keys and values — grows linearly with context. At 128K context with LLaMA 3 70B: ~42GB just for KV cache (see KV Cache glossary entry). Compression techniques: (1) Quantised KV cache (INT8/INT4): 2–4× memory reduction with <1% perplexity loss. (2) KV eviction (H2O, StreamingLLM): drop low-importance KV entries for old tokens. (3) KV merging: merge similar KV vectors. Trade-off: compression reduces memory but may cause forgetting of evicted context.)
- When is a smaller context window actually better for users than a large one? (Answer: Smaller context forces better information architecture: (1) Retrieval discipline — small windows require RAG, which pulls only relevant information rather than dumping everything. (2) Lower latency — smaller inputs = faster first token. (3) Lower cost — input tokens are billed; small context = cheaper per query. (4) Focus — models sometimes perform better on focused contexts than massive diluted ones. Best practice: use a context window appropriately sized for the task, not the maximum available.)
Practical guide: which model for which document size
Context window size should be one of the first factors when choosing a model for a document task. Here's a practical matching guide for the most common real-world document types.
| Document type | Typical token count | Recommended model | Strategy | Notes |
|---|---|---|---|---|
| Single email / short message | <1,000 tokens | Any model — Haiku/GPT-mini for cost | Direct in-context | No context planning needed; any model works |
| Research paper / article | 3,000–10,000 tokens | GPT-4o, Claude Sonnet, Gemini Flash | Direct in-context | Fits comfortably in any 32K+ model; no RAG needed |
| Technical specification / long report | 10,000–30,000 tokens | GPT-4o (128K), Claude Sonnet (200K) | Direct in-context | Consider RAG if you ask specific questions rather than full analysis |
| Full textbook chapter | 30,000–80,000 tokens | Claude Sonnet (200K), GPT-4o (128K) | Direct in-context OR RAG with page-range pinning | Watch for "lost in the middle" degradation — most relevant section first |
| Complete textbook (200 pages) | 60,000–130,000 tokens | Claude Sonnet (200K) or Gemini 1.5 Pro (1M) | RAG recommended for specific questions; full context for overview | Full-context single call: ~$0.40–$0.60 with Claude Sonnet |
| Large codebase (50+ files) | 100,000–500,000 tokens | Gemini 1.5/2.0 Pro (1M) | Agentic retrieval (read-relevant-files) OR Gemini 1M context | Claude Code / Cursor handle this via selective file loading rather than full context |
| Legal document corpus | 500,000–5,000,000 tokens | RAG over pgvector/Pinecone + any 128K+ model | RAG only — no single-context model handles this well | GraphRAG for entity-relationship queries; standard RAG for factual lookup |
| Entire novel (400 pages) | ~130,000 tokens | Claude Sonnet (200K) or Gemini Pro (1M) | Direct in-context for analysis; RAG for specific quotes | Gemini 1.5 Pro is the best current choice for full-book reasoning tasks |
Estimating document token count before choosing your model
import tiktoken
# tiktoken works for GPT-4o and approximates Claude/Gemini tokenization
# (Claude tokenizes ~5-10% differently but it's close enough for planning)
enc = tiktoken.encoding_for_model("gpt-4o")
def estimate_tokens(text: str) -> int:
return len(enc.encode(text))
def recommend_model(text: str) -> dict:
"""Given a document, recommend the right model and strategy."""
tokens = estimate_tokens(text)
approx_pages = tokens / 325 # ~325 tokens per page of text
if tokens < 10_000:
model = "gpt-4o-mini or claude-haiku-4-5-20251001"
strategy = "Direct in-context — fits easily"
cost_estimate = f"~${tokens * 0.15 / 1_000_000:.4f} input"
elif tokens < 60_000:
model = "gpt-4o (128K) or claude-sonnet-4-6 (200K)"
strategy = "Direct in-context — still fits well"
cost_estimate = f"~${tokens * 2.50 / 1_000_000:.4f} input"
elif tokens < 128_000:
model = "claude-sonnet-4-6 (200K) or gemini-1.5-pro (1M)"
strategy = "Direct in-context, but consider RAG for specific Q&A"
cost_estimate = f"~${tokens * 3.00 / 1_000_000:.4f} input (Claude)"
elif tokens < 500_000:
model = "gemini-1.5-pro or gemini-2.0-pro (1M context)"
strategy = "Gemini long-context OR RAG pipeline recommended"
cost_estimate = f"~${tokens * 1.25 / 1_000_000:.4f} input (Gemini 1.5)"
else:
model = "RAG pipeline (any 128K+ model)"
strategy = "RAG required — no single-context model handles this"
cost_estimate = "RAG indexing cost: ~$0.002 per 1K tokens (one-time)"
return {
"token_count": f"{tokens:,}",
"approximate_pages": f"~{approx_pages:.0f} pages",
"recommended_model": model,
"strategy": strategy,
"estimated_input_cost": cost_estimate
}
# Example usage
with open("my_document.txt", "r") as f:
doc = f.read()
recommendation = recommend_model(doc)
for key, value in recommendation.items():
print(f"{key}: {value}")Context window cost reality check
Large context windows are a feature but also a billing line item. Sending a 200K-token document to Claude Sonnet costs roughly $0.60 for the input alone — on every single API call. For a document Q&A app that processes the same document 1,000 times a day, that's $600/day just in input tokens. RAG typically reduces effective token usage by 90%+ by only retrieving relevant chunks: the same 1,000 queries with 3K tokens of retrieved context each = $7.50/day — 80× cheaper. Use full-context models for analysis tasks; use RAG for repetitive Q&A over the same documents.
On LumiChats
LumiChats supports a 5-million token effective context across sessions with its RAG pipeline. Documents too large for a single context window are chunked and retrieved semantically — the AI always receives the most relevant content, even for very long documents.
Try it free