Glossary/Context Window
AI Fundamentals

Context Window

How much an AI can 'see' at once.


Definition

A context window is the maximum number of tokens an AI model can process in a single request — including both input (your messages, files, instructions) and output (the AI's response). Everything outside the context window is invisible to the model. Modern context windows range from 32,000 tokens (Mistral 7B) to 1,000,000 tokens (Gemini 1.5 Pro).

What goes into the context window

The context window is a single flat sequence of tokens — not separate memory buckets. Everything competes for the same space:

  • System prompt — platform instructions, persona, injected RAG context
  • Conversation history — every prior user message and assistant response
  • Uploaded content — PDFs, code, data (inserted as text tokens)
  • Current response — tokens being generated consume context as they're produced

Estimating context usage before sending a request

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")   # GPT-4 / Claude tokenizer ≈ equivalent

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

system_prompt   = "You are a study assistant..."          # 500 tokens
conversation    = "User: explain X
Assistant: ..."      # 2,000 tokens
uploaded_pdf    = open("chapter3.txt").read()             # might be 8,000 tokens
current_query   = "Now summarize pages 30-45"            # 10 tokens

total_input = (count_tokens(system_prompt)
             + count_tokens(conversation)
             + count_tokens(uploaded_pdf)
             + count_tokens(current_query))

max_context = 128_000   # GPT-4o context
remaining   = max_context - total_input

print(f"Input tokens used: {total_input:,}")
print(f"Remaining for response: {remaining:,}")
print(f"Context used: {total_input/max_context:.1%}")

Rule of thumb

1 token ≈ 0.75 words ≈ 4 characters in English. A standard A4 page of text ≈ 400 tokens. A 200-page textbook ≈ 80,000–120,000 tokens. GPT-4o's 128K context fits roughly a full novel.

Context windows of popular models (2025)

ModelContext window≈ Words≈ PagesProvider
Claude Sonnet 3.7200,000 tokens150,000~500Anthropic
Gemini 1.5 Pro1,000,000 tokens750,000~2,500Google
GPT-4o128,000 tokens96,000~320OpenAI
DeepSeek V3128,000 tokens96,000~320DeepSeek
LLaMA 3 70B128,000 tokens96,000~320Meta (open)
Mistral Large128,000 tokens96,000~320Mistral AI

Larger context windows expand what's possible — loading an entire codebase, a book, a legal document — but they come at a cost: attention computation is O(n²) in sequence length, so longer contexts are proportionally more expensive to process.

Attention is quadratic in sequence length n and linear in model dimension d. Doubling context length quadruples attention compute.

What happens when you exceed the limit

When a conversation exceeds the context window, content must be dropped — typically the oldest messages first (a FIFO truncation). The model silently loses access to earlier context, which can cause contradictions, repetition, or loss of task continuity.

  • RAG (Retrieval-Augmented Generation) — retrieve only the most relevant chunks at query time instead of loading everything. Stays well within context limits regardless of total document size.
  • Summarization — compress older conversation turns into a running summary, preserving the gist without every token.
  • Context distillation — use a smaller model to compress context before passing to the main model.
  • Larger context model — upgrade to Gemini 1.5 Pro (1M tokens) or Claude's 200K for long-document use cases.

Silent failure

Context overflow usually fails silently — no error is thrown. The model simply can't access the dropped content. Monitor token usage in production to detect truncation before it affects output quality.

The 'lost in the middle' problem

Larger context windows don't guarantee uniform retrieval quality across them. Liu et al. (2023) showed a striking primacy-recency bias in LLM attention:

Lost in the middle (Liu et al., 2023)

When relevant information is placed in the middle of a long context, LLMs perform significantly worse at retrieving it compared to information placed at the beginning or end. Performance can drop by 20+ percentage points between beginning/end placement vs middle placement.

This means 1M token context is not equivalent to 1M tokens of perfectly accessible memory. Practical implication: structure prompts so that the most critical information appears at the start or end — not buried in the middle. RAG helps by surfacing the most relevant chunks to prominent positions.

Information placementRetrieval accuracyRecommendation
Beginning of contextHigh (primacy effect)Put key instructions and critical facts here
End of contextHigh (recency effect)Put the actual question/task here
Middle of contextSignificantly degradedAvoid placing critical information here

KV cache: how context windows are stored in GPU memory

During autoregressive generation, each Transformer layer computes key (K) and value (V) matrices for every input token. These must be cached so that generating token n+1 doesn't require recomputing all previous tokens from scratch.

KV cache grows linearly with sequence length T. For LLaMA 3 70B (80 layers, 64 heads, d_head=128), FP16: ~10 GB for 32K tokens, ~40 GB for 128K tokens.

Computing KV cache memory requirements for any model

def kv_cache_gb(
    n_layers: int,
    n_kv_heads: int,
    d_head: int,
    seq_len: int,
    dtype_bytes: int = 2,   # 2 = FP16/BF16, 1 = INT8, 4 = FP32
    batch_size: int = 1
) -> float:
    """Compute KV cache memory in gigabytes."""
    # 2 = K and V (both need to be cached)
    bytes_total = (2 * n_layers * n_kv_heads * d_head
                   * seq_len * batch_size * dtype_bytes)
    return bytes_total / (1024 ** 3)

# LLaMA 3 70B with 128K context (FP16)
print(f"LLaMA 3 70B @ 128K ctx: {kv_cache_gb(80, 8, 128, 128_000):.1f} GB")
# → ~21 GB just for KV cache (on top of ~140 GB model weights in FP16)

# GPT-2 Small (tiny model) for comparison
print(f"GPT-2 Small @ 4K ctx:   {kv_cache_gb(12, 12, 64, 4_096):.3f} GB")
# → ~0.018 GB — negligible

Why long-context serving is expensive

Serving 100 simultaneous users with 128K context on a 70B model requires ~2 TB of GPU memory just for KV caches (100 × 21 GB). This is why long-context inference costs significantly more than short-context inference, and why techniques like paged attention (vLLM) and GQA are critical for production deployments.

Practice questions

  1. What is the 'lost in the middle' problem and how should you structure prompts to avoid it? (Answer: Liu et al. (2023): LLMs retrieve information better from the beginning and end of the context window — information in the middle receives less attention. For a 10-document RAG prompt, the most relevant document should be first or last. Practical strategies: (1) Put critical instructions at the start AND repeat key constraints at the end. (2) For RAG: most relevant retrieved chunk first, then others, with the query last. (3) Use chunking strategies that keep related content contiguous rather than split across the context middle.)
  2. What is the difference between a model's training context length and its effective context length? (Answer: Training context length: the maximum sequence length used during training (where positional encodings are defined). Effective context length: how well the model actually uses information throughout the full window. Models often degrade in accuracy for facts placed deep in the middle of long contexts even within their stated window. Evaluation: NIAH (Needle in a Haystack) test — hide a fact at varying positions in a long document and measure retrieval accuracy. Many models claiming 128K context score poorly on NIAH beyond 32K.)
  3. Claude has a 200K context window. What is the approximate page count this can handle? (Answer: 1 page of text ≈ 250 words ≈ 325 tokens. 200,000 tokens / 325 ≈ 615 pages. In practice: a 400-page novel ≈ 130K tokens. A 200K context can hold: a complete novel + research paper + conversation history simultaneously. API pricing: input tokens at ~$3/million means a 200K context costs ~$0.60 per call. For document Q&A workflows: feed the entire document in context rather than chunking, but be aware of the 'lost in the middle' degradation for very long inputs.)
  4. What is KV cache compression and how does it extend effective context length? (Answer: KV cache stores all past token keys and values — grows linearly with context. At 128K context with LLaMA 3 70B: ~42GB just for KV cache (see KV Cache glossary entry). Compression techniques: (1) Quantised KV cache (INT8/INT4): 2–4× memory reduction with <1% perplexity loss. (2) KV eviction (H2O, StreamingLLM): drop low-importance KV entries for old tokens. (3) KV merging: merge similar KV vectors. Trade-off: compression reduces memory but may cause forgetting of evicted context.)
  5. When is a smaller context window actually better for users than a large one? (Answer: Smaller context forces better information architecture: (1) Retrieval discipline — small windows require RAG, which pulls only relevant information rather than dumping everything. (2) Lower latency — smaller inputs = faster first token. (3) Lower cost — input tokens are billed; small context = cheaper per query. (4) Focus — models sometimes perform better on focused contexts than massive diluted ones. Best practice: use a context window appropriately sized for the task, not the maximum available.)

On LumiChats

LumiChats supports a 5-million token effective context across sessions with its RAG pipeline. Documents too large for a single context window are chunked and retrieved semantically — the AI always receives the most relevant content, even for very long documents.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

5 terms