What is practice questions?

Context Window: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/context-window

Context Window

A context window is the maximum number of tokens an AI model can process in a single request — including both input (your messages, files, instructions) and output (the AI's response). Everything outside the context window is invisible to the model. Modern context windows range from 32,000 tokens (Mistral 7B) to 1,000,000 tokens (Gemini 1.5 Pro).

How much an AI can 'see' at once.

Category: AI Fundamentals

What goes into the context window

The context window is a single flat sequence of tokens — not separate memory buckets. Everything competes for the same space:

System prompt — platform instructions, persona, injected RAG context
Conversation history — every prior user message and assistant response
Uploaded content — PDFs, code, data (inserted as text tokens)
Current response — tokens being generated consume context as they're produced

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")   # GPT-4 / Claude tokenizer ≈ equivalent

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

system_prompt   = "You are a study assistant..."          # 500 tokens
conversation    = "User: explain X
Assistant: ..."      # 2,000 tokens
uploaded_pdf    = open("chapter3.txt").read()             # might be 8,000 tokens
current_query   = "Now summarize pages 30-45"            # 10 tokens

total_input = (count_tokens(system_prompt)
             + count_tokens(conversation)
             + count_tokens(uploaded_pdf)
             + count_tokens(current_query))

max_context = 128_000   # GPT-4o context
remaining   = max_context - total_input

print(f"Input tokens used: {total_input:,}")
print(f"Remaining for response: {remaining:,}")
print(f"Context used: {total_input/max_context:.1%}")

Rule of thumb: 1 token ≈ 0.75 words ≈ 4 characters in English. A standard A4 page of text ≈ 400 tokens. A 200-page textbook ≈ 80,000–120,000 tokens. GPT-4o's 128K context fits roughly a full novel.

Context windows of popular models (2025)

Model	Context window	≈ Words	≈ Pages	Provider
Claude Sonnet 3.7	200,000 tokens	150,000	~500	Anthropic
Gemini 1.5 Pro	1,000,000 tokens	750,000	~2,500	Google
GPT-4o	128,000 tokens	96,000	~320	OpenAI
DeepSeek V3	128,000 tokens	96,000	~320	DeepSeek
LLaMA 3 70B	128,000 tokens	96,000	~320	Meta (open)
Mistral Large	128,000 tokens	96,000	~320	Mistral AI

Larger context windows expand what's possible — loading an entire codebase, a book, a legal document — but they come at a cost: attention computation is O(n²) in sequence length, so longer contexts are proportionally more expensive to process.

\text{Attention cost} \propto n^2 \cdot d

What happens when you exceed the limit

When a conversation exceeds the context window, content must be dropped — typically the oldest messages first (a FIFO truncation). The model silently loses access to earlier context, which can cause contradictions, repetition, or loss of task continuity.

RAG (Retrieval-Augmented Generation) — retrieve only the most relevant chunks at query time instead of loading everything. Stays well within context limits regardless of total document size.
Summarization — compress older conversation turns into a running summary, preserving the gist without every token.
Context distillation — use a smaller model to compress context before passing to the main model.
Larger context model — upgrade to Gemini 1.5 Pro (1M tokens) or Claude's 200K for long-document use cases.

Silent failure: Context overflow usually fails silently — no error is thrown. The model simply can't access the dropped content. Monitor token usage in production to detect truncation before it affects output quality.

The 'lost in the middle' problem

Larger context windows don't guarantee uniform retrieval quality across them. Liu et al. (2023) showed a striking primacy-recency bias in LLM attention:

Lost in the middle (Liu et al., 2023): When relevant information is placed in the middle of a long context, LLMs perform significantly worse at retrieving it compared to information placed at the beginning or end. Performance can drop by 20+ percentage points between beginning/end placement vs middle placement.

This means 1M token context is not equivalent to 1M tokens of perfectly accessible memory. Practical implication: structure prompts so that the most critical information appears at the start or end — not buried in the middle. RAG helps by surfacing the most relevant chunks to prominent positions.

Information placement	Retrieval accuracy	Recommendation
Beginning of context	High (primacy effect)	Put key instructions and critical facts here
End of context	High (recency effect)	Put the actual question/task here
Middle of context	Significantly degraded	Avoid placing critical information here

KV cache: how context windows are stored in GPU memory

During autoregressive generation, each Transformer layer computes key (K) and value (V) matrices for every input token. These must be cached so that generating token n+1 doesn't require recomputing all previous tokens from scratch.

\text{KV cache size} = 2 \times n_{\text{layers}} \times n_{\text{heads}} \times d_{\text{head}} \times T \times \text{dtype\_bytes}

def kv_cache_gb(
    n_layers: int,
    n_kv_heads: int,
    d_head: int,
    seq_len: int,
    dtype_bytes: int = 2,   # 2 = FP16/BF16, 1 = INT8, 4 = FP32
    batch_size: int = 1
) -> float:
    """Compute KV cache memory in gigabytes."""
    # 2 = K and V (both need to be cached)
    bytes_total = (2 * n_layers * n_kv_heads * d_head
                   * seq_len * batch_size * dtype_bytes)
    return bytes_total / (1024 ** 3)

# LLaMA 3 70B with 128K context (FP16)
print(f"LLaMA 3 70B @ 128K ctx: {kv_cache_gb(80, 8, 128, 128_000):.1f} GB")
# → ~21 GB just for KV cache (on top of ~140 GB model weights in FP16)

# GPT-2 Small (tiny model) for comparison
print(f"GPT-2 Small @ 4K ctx:   {kv_cache_gb(12, 12, 64, 4_096):.3f} GB")
# → ~0.018 GB — negligible

Why long-context serving is expensive: Serving 100 simultaneous users with 128K context on a 70B model requires ~2 TB of GPU memory just for KV caches (100 × 21 GB). This is why long-context inference costs significantly more than short-context inference, and why techniques like paged attention (vLLM) and GQA are critical for production deployments.

Practice questions

What is the 'lost in the middle' problem and how should you structure prompts to avoid it? (Answer: Liu et al. (2023): LLMs retrieve information better from the beginning and end of the context window — information in the middle receives less attention. For a 10-document RAG prompt, the most relevant document should be first or last. Practical strategies: (1) Put critical instructions at the start AND repeat key constraints at the end. (2) For RAG: most relevant retrieved chunk first, then others, with the query last. (3) Use chunking strategies that keep related content contiguous rather than split across the context middle.)
What is the difference between a model's training context length and its effective context length? (Answer: Training context length: the maximum sequence length used during training (where positional encodings are defined). Effective context length: how well the model actually uses information throughout the full window. Models often degrade in accuracy for facts placed deep in the middle of long contexts even within their stated window. Evaluation: NIAH (Needle in a Haystack) test — hide a fact at varying positions in a long document and measure retrieval accuracy. Many models claiming 128K context score poorly on NIAH beyond 32K.)
Claude has a 200K context window. What is the approximate page count this can handle? (Answer: 1 page of text ≈ 250 words ≈ 325 tokens. 200,000 tokens / 325 ≈ 615 pages. In practice: a 400-page novel ≈ 130K tokens. A 200K context can hold: a complete novel + research paper + conversation history simultaneously. API pricing: input tokens at ~$3/million means a 200K context costs ~$0.60 per call. For document Q&A workflows: feed the entire document in context rather than chunking, but be aware of the 'lost in the middle' degradation for very long inputs.)
What is KV cache compression and how does it extend effective context length? (Answer: KV cache stores all past token keys and values — grows linearly with context. At 128K context with LLaMA 3 70B: ~42GB just for KV cache (see KV Cache glossary entry). Compression techniques: (1) Quantised KV cache (INT8/INT4): 2–4× memory reduction with <1% perplexity loss. (2) KV eviction (H2O, StreamingLLM): drop low-importance KV entries for old tokens. (3) KV merging: merge similar KV vectors. Trade-off: compression reduces memory but may cause forgetting of evicted context.)
When is a smaller context window actually better for users than a large one? (Answer: Smaller context forces better information architecture: (1) Retrieval discipline — small windows require RAG, which pulls only relevant information rather than dumping everything. (2) Lower latency — smaller inputs = faster first token. (3) Lower cost — input tokens are billed; small context = cheaper per query. (4) Focus — models sometimes perform better on focused contexts than massive diluted ones. Best practice: use a context window appropriately sized for the task, not the maximum available.)

Practical guide: which model for which document size

Context window size should be one of the first factors when choosing a model for a document task. Here's a practical matching guide for the most common real-world document types.

Document type	Typical token count	Recommended model	Strategy	Notes
Single email / short message	<1,000 tokens	Any model — Haiku/GPT-mini for cost	Direct in-context	No context planning needed; any model works
Research paper / article	3,000–10,000 tokens	GPT-4o, Claude Sonnet, Gemini Flash	Direct in-context	Fits comfortably in any 32K+ model; no RAG needed
Technical specification / long report	10,000–30,000 tokens	GPT-4o (128K), Claude Sonnet (200K)	Direct in-context	Consider RAG if you ask specific questions rather than full analysis
Full textbook chapter	30,000–80,000 tokens	Claude Sonnet (200K), GPT-4o (128K)	Direct in-context OR RAG with page-range pinning	Watch for "lost in the middle" degradation — most relevant section first
Complete textbook (200 pages)	60,000–130,000 tokens	Claude Sonnet (200K) or Gemini 1.5 Pro (1M)	RAG recommended for specific questions; full context for overview	Full-context single call: ~$0.40–$0.60 with Claude Sonnet
Large codebase (50+ files)	100,000–500,000 tokens	Gemini 1.5/2.0 Pro (1M)	Agentic retrieval (read-relevant-files) OR Gemini 1M context	Claude Code / Cursor handle this via selective file loading rather than full context
Legal document corpus	500,000–5,000,000 tokens	RAG over pgvector/Pinecone + any 128K+ model	RAG only — no single-context model handles this well	GraphRAG for entity-relationship queries; standard RAG for factual lookup
Entire novel (400 pages)	~130,000 tokens	Claude Sonnet (200K) or Gemini Pro (1M)	Direct in-context for analysis; RAG for specific quotes	Gemini 1.5 Pro is the best current choice for full-book reasoning tasks

import tiktoken

# tiktoken works for GPT-4o and approximates Claude/Gemini tokenization
# (Claude tokenizes ~5-10% differently but it's close enough for planning)
enc = tiktoken.encoding_for_model("gpt-4o")

def estimate_tokens(text: str) -> int:
    return len(enc.encode(text))

def recommend_model(text: str) -> dict:
    """Given a document, recommend the right model and strategy."""
    tokens = estimate_tokens(text)
    approx_pages = tokens / 325   # ~325 tokens per page of text

    if tokens < 10_000:
        model = "gpt-4o-mini or claude-haiku-4-5-20251001"
        strategy = "Direct in-context — fits easily"
        cost_estimate = f"~${tokens * 0.15 / 1_000_000:.4f} input"
    elif tokens < 60_000:
        model = "gpt-4o (128K) or claude-sonnet-4-6 (200K)"
        strategy = "Direct in-context — still fits well"
        cost_estimate = f"~${tokens * 2.50 / 1_000_000:.4f} input"
    elif tokens < 128_000:
        model = "claude-sonnet-4-6 (200K) or gemini-1.5-pro (1M)"
        strategy = "Direct in-context, but consider RAG for specific Q&A"
        cost_estimate = f"~${tokens * 3.00 / 1_000_000:.4f} input (Claude)"
    elif tokens < 500_000:
        model = "gemini-1.5-pro or gemini-2.0-pro (1M context)"
        strategy = "Gemini long-context OR RAG pipeline recommended"
        cost_estimate = f"~${tokens * 1.25 / 1_000_000:.4f} input (Gemini 1.5)"
    else:
        model = "RAG pipeline (any 128K+ model)"
        strategy = "RAG required — no single-context model handles this"
        cost_estimate = "RAG indexing cost: ~$0.002 per 1K tokens (one-time)"

    return {
        "token_count": f"{tokens:,}",
        "approximate_pages": f"~{approx_pages:.0f} pages",
        "recommended_model": model,
        "strategy": strategy,
        "estimated_input_cost": cost_estimate
    }

# Example usage
with open("my_document.txt", "r") as f:
    doc = f.read()

recommendation = recommend_model(doc)
for key, value in recommendation.items():
    print(f"{key}: {value}")

Context window cost reality check: Large context windows are a feature but also a billing line item. Sending a 200K-token document to Claude Sonnet costs roughly $0.60 for the input alone — on every single API call. For a document Q&A app that processes the same document 1,000 times a day, that's $600/day just in input tokens. RAG typically reduces effective token usage by 90%+ by only retrieving relevant chunks: the same 1,000 queries with 3K tokens of retrieved context each = $7.50/day — 80× cheaper. Use full-context models for analysis tasks; use RAG for repetitive Q&A over the same documents.

LumiChats supports a 5-million token effective context across sessions with its RAG pipeline. Documents too large for a single context window are chunked and retrieved semantically — the AI always receives the most relevant content, even for very long documents.

Definition

What goes into the context window

The context window is a single flat sequence of tokens — not separate memory buckets. Everything competes for the same space:

System prompt — platform instructions, persona, injected RAG context
Conversation history — every prior user message and assistant response
Uploaded content — PDFs, code, data (inserted as text tokens)
Current response — tokens being generated consume context as they're produced

Estimating context usage before sending a request

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")   # GPT-4 / Claude tokenizer ≈ equivalent

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

system_prompt   = "You are a study assistant..."          # 500 tokens
conversation    = "User: explain X
Assistant: ..."      # 2,000 tokens
uploaded_pdf    = open("chapter3.txt").read()             # might be 8,000 tokens
current_query   = "Now summarize pages 30-45"            # 10 tokens

total_input = (count_tokens(system_prompt)
             + count_tokens(conversation)
             + count_tokens(uploaded_pdf)
             + count_tokens(current_query))

max_context = 128_000   # GPT-4o context
remaining   = max_context - total_input

print(f"Input tokens used: {total_input:,}")
print(f"Remaining for response: {remaining:,}")
print(f"Context used: {total_input/max_context:.1%}")

Rule of thumb

1 token ≈ 0.75 words ≈ 4 characters in English. A standard A4 page of text ≈ 400 tokens. A 200-page textbook ≈ 80,000–120,000 tokens. GPT-4o's 128K context fits roughly a full novel.

Context windows of popular models (2025)

Model	Context window	≈ Words	≈ Pages	Provider
Claude Sonnet 3.7	200,000 tokens	150,000	~500	Anthropic
Gemini 1.5 Pro	1,000,000 tokens	750,000	~2,500	Google
GPT-4o	128,000 tokens	96,000	~320	OpenAI
DeepSeek V3	128,000 tokens	96,000	~320	DeepSeek
LLaMA 3 70B	128,000 tokens	96,000	~320	Meta (open)
Mistral Large	128,000 tokens	96,000	~320	Mistral AI

Attention is quadratic in sequence length n and linear in model dimension d. Doubling context length quadruples attention compute.

What happens when you exceed the limit

RAG (Retrieval-Augmented Generation) — retrieve only the most relevant chunks at query time instead of loading everything. Stays well within context limits regardless of total document size.
Summarization — compress older conversation turns into a running summary, preserving the gist without every token.
Context distillation — use a smaller model to compress context before passing to the main model.
Larger context model — upgrade to Gemini 1.5 Pro (1M tokens) or Claude's 200K for long-document use cases.

Silent failure

Context overflow usually fails silently — no error is thrown. The model simply can't access the dropped content. Monitor token usage in production to detect truncation before it affects output quality.

The 'lost in the middle' problem

Larger context windows don't guarantee uniform retrieval quality across them. Liu et al. (2023) showed a striking primacy-recency bias in LLM attention:

Lost in the middle (Liu et al., 2023)

When relevant information is placed in the middle of a long context, LLMs perform significantly worse at retrieving it compared to information placed at the beginning or end. Performance can drop by 20+ percentage points between beginning/end placement vs middle placement.

Information placement	Retrieval accuracy	Recommendation
Beginning of context	High (primacy effect)	Put key instructions and critical facts here
End of context	High (recency effect)	Put the actual question/task here
Middle of context	Significantly degraded	Avoid placing critical information here

KV cache: how context windows are stored in GPU memory

KV cache grows linearly with sequence length T. For LLaMA 3 70B (80 layers, 64 heads, d_head=128), FP16: ~10 GB for 32K tokens, ~40 GB for 128K tokens.

Computing KV cache memory requirements for any model

def kv_cache_gb(
    n_layers: int,
    n_kv_heads: int,
    d_head: int,
    seq_len: int,
    dtype_bytes: int = 2,   # 2 = FP16/BF16, 1 = INT8, 4 = FP32
    batch_size: int = 1
) -> float:
    """Compute KV cache memory in gigabytes."""
    # 2 = K and V (both need to be cached)
    bytes_total = (2 * n_layers * n_kv_heads * d_head
                   * seq_len * batch_size * dtype_bytes)
    return bytes_total / (1024 ** 3)

# LLaMA 3 70B with 128K context (FP16)
print(f"LLaMA 3 70B @ 128K ctx: {kv_cache_gb(80, 8, 128, 128_000):.1f} GB")
# → ~21 GB just for KV cache (on top of ~140 GB model weights in FP16)

# GPT-2 Small (tiny model) for comparison
print(f"GPT-2 Small @ 4K ctx:   {kv_cache_gb(12, 12, 64, 4_096):.3f} GB")
# → ~0.018 GB — negligible

Why long-context serving is expensive

Serving 100 simultaneous users with 128K context on a 70B model requires ~2 TB of GPU memory just for KV caches (100 × 21 GB). This is why long-context inference costs significantly more than short-context inference, and why techniques like paged attention (vLLM) and GQA are critical for production deployments.

Practice questions

What is the 'lost in the middle' problem and how should you structure prompts to avoid it? (Answer: Liu et al. (2023): LLMs retrieve information better from the beginning and end of the context window — information in the middle receives less attention. For a 10-document RAG prompt, the most relevant document should be first or last. Practical strategies: (1) Put critical instructions at the start AND repeat key constraints at the end. (2) For RAG: most relevant retrieved chunk first, then others, with the query last. (3) Use chunking strategies that keep related content contiguous rather than split across the context middle.)
What is the difference between a model's training context length and its effective context length? (Answer: Training context length: the maximum sequence length used during training (where positional encodings are defined). Effective context length: how well the model actually uses information throughout the full window. Models often degrade in accuracy for facts placed deep in the middle of long contexts even within their stated window. Evaluation: NIAH (Needle in a Haystack) test — hide a fact at varying positions in a long document and measure retrieval accuracy. Many models claiming 128K context score poorly on NIAH beyond 32K.)
Claude has a 200K context window. What is the approximate page count this can handle? (Answer: 1 page of text ≈ 250 words ≈ 325 tokens. 200,000 tokens / 325 ≈ 615 pages. In practice: a 400-page novel ≈ 130K tokens. A 200K context can hold: a complete novel + research paper + conversation history simultaneously. API pricing: input tokens at ~$3/million means a 200K context costs ~$0.60 per call. For document Q&A workflows: feed the entire document in context rather than chunking, but be aware of the 'lost in the middle' degradation for very long inputs.)
What is KV cache compression and how does it extend effective context length? (Answer: KV cache stores all past token keys and values — grows linearly with context. At 128K context with LLaMA 3 70B: ~42GB just for KV cache (see KV Cache glossary entry). Compression techniques: (1) Quantised KV cache (INT8/INT4): 2–4× memory reduction with <1% perplexity loss. (2) KV eviction (H2O, StreamingLLM): drop low-importance KV entries for old tokens. (3) KV merging: merge similar KV vectors. Trade-off: compression reduces memory but may cause forgetting of evicted context.)
When is a smaller context window actually better for users than a large one? (Answer: Smaller context forces better information architecture: (1) Retrieval discipline — small windows require RAG, which pulls only relevant information rather than dumping everything. (2) Lower latency — smaller inputs = faster first token. (3) Lower cost — input tokens are billed; small context = cheaper per query. (4) Focus — models sometimes perform better on focused contexts than massive diluted ones. Best practice: use a context window appropriately sized for the task, not the maximum available.)

Practical guide: which model for which document size

Context window size should be one of the first factors when choosing a model for a document task. Here's a practical matching guide for the most common real-world document types.

Document type	Typical token count	Recommended model	Strategy	Notes
Single email / short message	<1,000 tokens	Any model — Haiku/GPT-mini for cost	Direct in-context	No context planning needed; any model works
Research paper / article	3,000–10,000 tokens	GPT-4o, Claude Sonnet, Gemini Flash	Direct in-context	Fits comfortably in any 32K+ model; no RAG needed
Technical specification / long report	10,000–30,000 tokens	GPT-4o (128K), Claude Sonnet (200K)	Direct in-context	Consider RAG if you ask specific questions rather than full analysis
Full textbook chapter	30,000–80,000 tokens	Claude Sonnet (200K), GPT-4o (128K)	Direct in-context OR RAG with page-range pinning	Watch for "lost in the middle" degradation — most relevant section first
Complete textbook (200 pages)	60,000–130,000 tokens	Claude Sonnet (200K) or Gemini 1.5 Pro (1M)	RAG recommended for specific questions; full context for overview	Full-context single call: ~$0.40–$0.60 with Claude Sonnet
Large codebase (50+ files)	100,000–500,000 tokens	Gemini 1.5/2.0 Pro (1M)	Agentic retrieval (read-relevant-files) OR Gemini 1M context	Claude Code / Cursor handle this via selective file loading rather than full context
Legal document corpus	500,000–5,000,000 tokens	RAG over pgvector/Pinecone + any 128K+ model	RAG only — no single-context model handles this well	GraphRAG for entity-relationship queries; standard RAG for factual lookup
Entire novel (400 pages)	~130,000 tokens	Claude Sonnet (200K) or Gemini Pro (1M)	Direct in-context for analysis; RAG for specific quotes	Gemini 1.5 Pro is the best current choice for full-book reasoning tasks

Estimating document token count before choosing your model

import tiktoken

# tiktoken works for GPT-4o and approximates Claude/Gemini tokenization
# (Claude tokenizes ~5-10% differently but it's close enough for planning)
enc = tiktoken.encoding_for_model("gpt-4o")

def estimate_tokens(text: str) -> int:
    return len(enc.encode(text))

def recommend_model(text: str) -> dict:
    """Given a document, recommend the right model and strategy."""
    tokens = estimate_tokens(text)
    approx_pages = tokens / 325   # ~325 tokens per page of text

    if tokens < 10_000:
        model = "gpt-4o-mini or claude-haiku-4-5-20251001"
        strategy = "Direct in-context — fits easily"
        cost_estimate = f"~${tokens * 0.15 / 1_000_000:.4f} input"
    elif tokens < 60_000:
        model = "gpt-4o (128K) or claude-sonnet-4-6 (200K)"
        strategy = "Direct in-context — still fits well"
        cost_estimate = f"~${tokens * 2.50 / 1_000_000:.4f} input"
    elif tokens < 128_000:
        model = "claude-sonnet-4-6 (200K) or gemini-1.5-pro (1M)"
        strategy = "Direct in-context, but consider RAG for specific Q&A"
        cost_estimate = f"~${tokens * 3.00 / 1_000_000:.4f} input (Claude)"
    elif tokens < 500_000:
        model = "gemini-1.5-pro or gemini-2.0-pro (1M context)"
        strategy = "Gemini long-context OR RAG pipeline recommended"
        cost_estimate = f"~${tokens * 1.25 / 1_000_000:.4f} input (Gemini 1.5)"
    else:
        model = "RAG pipeline (any 128K+ model)"
        strategy = "RAG required — no single-context model handles this"
        cost_estimate = "RAG indexing cost: ~$0.002 per 1K tokens (one-time)"

    return {
        "token_count": f"{tokens:,}",
        "approximate_pages": f"~{approx_pages:.0f} pages",
        "recommended_model": model,
        "strategy": strategy,
        "estimated_input_cost": cost_estimate
    }

# Example usage
with open("my_document.txt", "r") as f:
    doc = f.read()

recommendation = recommend_model(doc)
for key, value in recommendation.items():
    print(f"{key}: {value}")

Context window cost reality check

Large context windows are a feature but also a billing line item. Sending a 200K-token document to Claude Sonnet costs roughly $0.60 for the input alone — on every single API call. For a document Q&A app that processes the same document 1,000 times a day, that's $600/day just in input tokens. RAG typically reduces effective token usage by 90%+ by only retrieving relevant chunks: the same 1,000 queries with 3K tokens of retrieved context each = $7.50/day — 80× cheaper. Use full-context models for analysis tasks; use RAG for repetitive Q&A over the same documents.

On LumiChats

Try it free

Context Window

What goes into the context window

Context windows of popular models (2025)

What happens when you exceed the limit

The 'lost in the middle' problem

KV cache: how context windows are stored in GPU memory

Practice questions

Practical guide: which model for which document size

Context Window

What goes into the context window

Context windows of popular models (2025)

What happens when you exceed the limit

The 'lost in the middle' problem

KV cache: how context windows are stored in GPU memory

Practice questions

Practical guide: which model for which document size

Practice what you just learned

Related Terms