What is practice questions?

Large Language Model (LLM): Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/large-language-model

Large Language Model (LLM)

A Large Language Model (LLM) is an AI system trained on massive amounts of text data to understand and generate human language. LLMs power tools like ChatGPT, Claude, Gemini, and LumiChats. They are built on the Transformer architecture and contain billions to trillions of learnable parameters.

The technology behind every AI chatbot.

Category: AI Fundamentals

How LLMs work

LLMs are trained with a deceptively simple objective: predict the next token. Given a sequence of tokens x₁, x₂, …, xₙ, the model computes the probability of each possible next token:

\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log P_{\theta}(x_t \mid x_1, \ldots, x_{t-1})

Over trillions of such predictions (GPT-3 trained on ~300 billion tokens; LLaMA 3 on 15 trillion), the model's billions of parameters are adjusted via backpropagation and Adam to reduce this loss — implicitly learning grammar, facts, reasoning, and code. At inference time, tokens are generated one at a time, each sampled from the predicted probability distribution.

import torch
import torch.nn.functional as F

def generate(model, tokenizer, prompt: str, max_new_tokens=50, temperature=1.0):
    """Minimal autoregressive generation loop — this is what every LLM inference does."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    for _ in range(max_new_tokens):
        with torch.no_grad():
            logits = model(input_ids).logits   # shape: (1, seq_len, vocab_size)

        # Take logits for the LAST token position only
        next_token_logits = logits[:, -1, :] / temperature   # (1, vocab_size)

        # Convert to probabilities and sample
        probs = F.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)  # (1, 1)

        # Append to sequence and continue
        input_ids = torch.cat([input_ids, next_token], dim=1)

        # Stop at end-of-sequence token
        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

Self-supervised learning: No human labels are needed — the training signal comes from the text itself. Every document on the internet is a self-labeled training example: the model predicts each word from the preceding words. This is why web-scale training is possible.

What 'large' means: scaling laws

The 'large' in LLM refers to parameter count — but the relationship between scale and capability is governed by precise empirical laws. Kaplan et al. (OpenAI, 2020) showed that loss follows a power law in model size N, dataset size D, and compute C:

L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}

Hoffmann et al. (DeepMind, 2022) — the 'Chinchilla' paper — showed the original laws were suboptimal: for a given compute budget, the optimal model is smaller and trained on more data than previously thought. The Chinchilla-optimal ratio: ~20 tokens of training data per parameter.

Model	Parameters	Training tokens	Organization	Year
GPT-2	1.5B	40B	OpenAI	2019
GPT-3	175B	300B	OpenAI	2020
Chinchilla	70B	1.4T	DeepMind	2022
LLaMA 3	8B / 70B / 405B	15T	Meta	2024
GPT-4 (est.)	~1T (MoE)	~13T	OpenAI	2023

Chinchilla insight: A 70B model trained on 1.4T tokens outperforms a 280B model trained on the same compute budget on fewer tokens. More data, smaller model — same cost, better results. This insight drove the design of LLaMA 3 (15T tokens) and Mistral.

Why models differ

Not all LLMs are equal. Four axes explain most performance differences:

Dimension	What it affects	Examples
Training data	Knowledge coverage, language support, code/math ability	Common Crawl, GitHub, arXiv, Wikipedia, books
Training compute	Model size, data volume — bounded by GPU budget	GPT-4 training: est. $50–100M. LLaMA 3 405B: est. $10M+
Architecture choices	Context length, efficiency, specialization	MoE vs dense; RoPE vs ALiBi; GQA vs MHA
Post-training	Alignment, safety, instruction-following	RLHF, DPO, Constitutional AI, RLAIF

These differences explain observed specializations: Claude excels at nuanced reasoning, instruction-following, and safety (Constitutional AI + RLHF). DeepSeek and Qwen lead on code and math (heavy GitHub + math dataset training). Gemini leads on multimodal and long-context tasks (1M+ token context, native video training). Mistral leads on efficiency (GQA, sliding window attention for smaller models).

Model card: Every serious LLM release includes a model card specifying training data sources, known limitations, intended use, and safety evaluations. Always check the model card before deploying in a production application.

The pretraining → post-training pipeline

Modern LLMs are built in two major phases with distinct objectives and costs:

Pretraining: Self-supervised next-token prediction on a massive corpus (trillions of tokens). Trains for weeks to months on thousands of GPUs. Cost: $1M–$100M+. Produces a base model that can complete text but doesn't follow instructions or refuse harmful requests.
Post-training: (a) Supervised Fine-Tuning (SFT) on (instruction, response) demonstration pairs — teaches instruction-following. (b) RLHF or DPO on human preference data — teaches helpfulness, safety, and alignment. Total cost: $1K–$10M. Produces the conversational model you interact with.

# LLaMA 3 training data composition (approximate, based on Meta's paper)
TRAINING_MIX = {
    "general_web":      {"weight": 0.50, "source": "Common Crawl (cleaned)"},
    "code":             {"weight": 0.17, "source": "GitHub, Stack Overflow"},
    "books":            {"weight": 0.10, "source": "Books, long-form text"},
    "scientific":       {"weight": 0.08, "source": "arXiv, PubMed, academic papers"},
    "wikipedia":        {"weight": 0.05, "source": "Wikipedia (multilingual)"},
    "math":             {"weight": 0.05, "source": "Math Stack Exchange, proofs"},
    "multilingual":     {"weight": 0.05, "source": "CC news in 30+ languages"},
}

total_tokens = 15e12   # 15 trillion tokens

for name, info in TRAINING_MIX.items():
    tokens = total_tokens * info["weight"]
    print(f"{name:20s}: {tokens/1e12:.2f}T tokens  ({info['source']})")

Why data quality dominates: The biggest gains in recent models (Phi-3, Mistral, Qwen) come not from scale but from data curation. Phi-3-mini (3.8B parameters) matches GPT-3.5 quality by training on high-quality "textbook-style" synthetic data rather than raw web dumps.

Limitations

Understanding LLM limitations is essential for safe deployment:

Limitation	Root cause	Mitigation
Knowledge cutoff	Training data has a fixed end date	RAG, web search tools, retrieval augmentation
Hallucination	Generates plausible text, not verified facts	RAG, citations, tool use, Constitutional AI
No persistent memory	Stateless — each conversation starts fresh	External memory systems, database-backed retrieval
Context window limits	Attention is O(n²) in compute/memory	Larger context models, RAG, summarization
Inference cost	Billions of FLOPs per token generated	Quantization, speculative decoding, caching
No real-time action	Cannot browse web, run code natively	Tool use, code interpreters, agentic frameworks

The hallucination trap: LLMs cannot distinguish between "I know this" and "this sounds plausible." They generate fluent, confident text even when wrong. Always verify factual claims — especially statistics, citations, and specific dates — from primary sources.

Practice questions

What is an LLM's context window and what happens when content exceeds it? (Answer: Context window = the maximum number of tokens the model can process in one forward pass (input + output combined). Claude 3.5: 200K tokens. GPT-4o: 128K. LLaMA 3.1: 128K. When content exceeds the limit: API throws a context_length_exceeded error (you must chunk or summarize). Within the window: all tokens receive full attention — no information is lost. Common workaround: sliding window chunking with overlap, RAG (retrieve only relevant portions), or summarization of earlier context.)
Why do LLMs sometimes confidently state false information (hallucinate)? (Answer: LLMs are trained to predict the next most probable token given context — not to verify factual accuracy. They have no internal fact-checking mechanism. Hallucination occurs when: (1) the fact was rare or absent in training data, (2) the model interpolates between related facts incorrectly, (3) the model confabulates plausible-sounding completions when uncertain. Mitigation: Retrieval-Augmented Generation (RAG) grounds responses in verified sources; Constitutional AI training reduces confident false statements.)
What is the difference between a base model and an instruction-tuned model? (Answer: Base model (pretrained): trained to predict next tokens on massive text corpora — continues text in the style of its training data. Chatting with a base model: it completes your message as if continuing a document, not responding helpfully. Instruction-tuned (chat model): fine-tuned on (instruction, response) pairs + RLHF — trained to be a helpful assistant that follows directions. All deployed products (ChatGPT, Claude, Gemini) are instruction-tuned. Base models are used for research and further fine-tuning.)
How do LLMs achieve few-shot learning from examples in the prompt without weight updates? (Answer: In-context learning: the model uses the examples in the prompt as implicit task specification. The transformer's attention mechanism can identify the pattern across examples and apply it to the new input. This works because during pretraining, the model saw countless documents that implicitly demonstrated tasks through examples. Few-shot ICL is thus a form of pattern matching using the model's pretrained world knowledge rather than gradient-based adaptation.)
What is the difference between LLM fine-tuning and prompting for a specific task, and when should you use each? (Answer: Prompting: craft instructions/examples in the prompt to guide the model. Zero cost, instant to change, works well for tasks the model already has capability for. Fine-tuning: train model weights on task-specific data. Higher upfront cost, produces consistent behavior, better for tasks requiring style/format consistency, domain-specific knowledge, or reliably following complex instructions. Rule of thumb: try prompting first. If you hit consistent failure modes despite good prompting, and you have 100+ quality examples, fine-tune.)

LumiChats gives you access to 39+ LLMs — including GPT-4o, Claude Sonnet, Gemini Pro, DeepSeek V3, Qwen, and Mistral — in a single platform. Switching between models lets you use the best LLM for each specific task.

Definition

How LLMs work

LLMs are trained with a deceptively simple objective: predict the next token. Given a sequence of tokens x₁, x₂, …, xₙ, the model computes the probability of each possible next token:

Cross-entropy training loss: the model is penalized for assigning low probability to the correct next token. Minimizing this over trillions of tokens teaches the model language structure, facts, and reasoning.

Autoregressive generation: how LLMs produce text token by token

import torch
import torch.nn.functional as F

def generate(model, tokenizer, prompt: str, max_new_tokens=50, temperature=1.0):
    """Minimal autoregressive generation loop — this is what every LLM inference does."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    for _ in range(max_new_tokens):
        with torch.no_grad():
            logits = model(input_ids).logits   # shape: (1, seq_len, vocab_size)

        # Take logits for the LAST token position only
        next_token_logits = logits[:, -1, :] / temperature   # (1, vocab_size)

        # Convert to probabilities and sample
        probs = F.softmax(next_token_logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)  # (1, 1)

        # Append to sequence and continue
        input_ids = torch.cat([input_ids, next_token], dim=1)

        # Stop at end-of-sequence token
        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

Self-supervised learning

No human labels are needed — the training signal comes from the text itself. Every document on the internet is a self-labeled training example: the model predicts each word from the preceding words. This is why web-scale training is possible.

What 'large' means: scaling laws

Kaplan scaling laws: loss decreases as a power law with model size and data size. αN ≈ 0.076, αD ≈ 0.095 from the original paper.

Model	Parameters	Training tokens	Organization	Year
GPT-2	1.5B	40B	OpenAI	2019
GPT-3	175B	300B	OpenAI	2020
Chinchilla	70B	1.4T	DeepMind	2022
LLaMA 3	8B / 70B / 405B	15T	Meta	2024
GPT-4 (est.)	~1T (MoE)	~13T	OpenAI	2023

Chinchilla insight

A 70B model trained on 1.4T tokens outperforms a 280B model trained on the same compute budget on fewer tokens. More data, smaller model — same cost, better results. This insight drove the design of LLaMA 3 (15T tokens) and Mistral.

Why models differ

Not all LLMs are equal. Four axes explain most performance differences:

Dimension	What it affects	Examples
Training data	Knowledge coverage, language support, code/math ability	Common Crawl, GitHub, arXiv, Wikipedia, books
Training compute	Model size, data volume — bounded by GPU budget	GPT-4 training: est. $50–100M. LLaMA 3 405B: est. $10M+
Architecture choices	Context length, efficiency, specialization	MoE vs dense; RoPE vs ALiBi; GQA vs MHA
Post-training	Alignment, safety, instruction-following	RLHF, DPO, Constitutional AI, RLAIF

Model card

Every serious LLM release includes a model card specifying training data sources, known limitations, intended use, and safety evaluations. Always check the model card before deploying in a production application.

The pretraining → post-training pipeline

Modern LLMs are built in two major phases with distinct objectives and costs:

Pretraining: Self-supervised next-token prediction on a massive corpus (trillions of tokens). Trains for weeks to months on thousands of GPUs. Cost: $1M–$100M+. Produces a base model that can complete text but doesn't follow instructions or refuse harmful requests.
Post-training: (a) Supervised Fine-Tuning (SFT) on (instruction, response) demonstration pairs — teaches instruction-following. (b) RLHF or DPO on human preference data — teaches helpfulness, safety, and alignment. Total cost: $1K–$10M. Produces the conversational model you interact with.

Pretraining data mix — how training corpora are assembled (LLaMA 3 style)

# LLaMA 3 training data composition (approximate, based on Meta's paper)
TRAINING_MIX = {
    "general_web":      {"weight": 0.50, "source": "Common Crawl (cleaned)"},
    "code":             {"weight": 0.17, "source": "GitHub, Stack Overflow"},
    "books":            {"weight": 0.10, "source": "Books, long-form text"},
    "scientific":       {"weight": 0.08, "source": "arXiv, PubMed, academic papers"},
    "wikipedia":        {"weight": 0.05, "source": "Wikipedia (multilingual)"},
    "math":             {"weight": 0.05, "source": "Math Stack Exchange, proofs"},
    "multilingual":     {"weight": 0.05, "source": "CC news in 30+ languages"},
}

total_tokens = 15e12   # 15 trillion tokens

for name, info in TRAINING_MIX.items():
    tokens = total_tokens * info["weight"]
    print(f"{name:20s}: {tokens/1e12:.2f}T tokens  ({info['source']})")

Why data quality dominates

The biggest gains in recent models (Phi-3, Mistral, Qwen) come not from scale but from data curation. Phi-3-mini (3.8B parameters) matches GPT-3.5 quality by training on high-quality "textbook-style" synthetic data rather than raw web dumps.

Limitations

Understanding LLM limitations is essential for safe deployment:

Limitation	Root cause	Mitigation
Knowledge cutoff	Training data has a fixed end date	RAG, web search tools, retrieval augmentation
Hallucination	Generates plausible text, not verified facts	RAG, citations, tool use, Constitutional AI
No persistent memory	Stateless — each conversation starts fresh	External memory systems, database-backed retrieval
Context window limits	Attention is O(n²) in compute/memory	Larger context models, RAG, summarization
Inference cost	Billions of FLOPs per token generated	Quantization, speculative decoding, caching
No real-time action	Cannot browse web, run code natively	Tool use, code interpreters, agentic frameworks

The hallucination trap

LLMs cannot distinguish between "I know this" and "this sounds plausible." They generate fluent, confident text even when wrong. Always verify factual claims — especially statistics, citations, and specific dates — from primary sources.

Practice questions

What is an LLM's context window and what happens when content exceeds it? (Answer: Context window = the maximum number of tokens the model can process in one forward pass (input + output combined). Claude 3.5: 200K tokens. GPT-4o: 128K. LLaMA 3.1: 128K. When content exceeds the limit: API throws a context_length_exceeded error (you must chunk or summarize). Within the window: all tokens receive full attention — no information is lost. Common workaround: sliding window chunking with overlap, RAG (retrieve only relevant portions), or summarization of earlier context.)
Why do LLMs sometimes confidently state false information (hallucinate)? (Answer: LLMs are trained to predict the next most probable token given context — not to verify factual accuracy. They have no internal fact-checking mechanism. Hallucination occurs when: (1) the fact was rare or absent in training data, (2) the model interpolates between related facts incorrectly, (3) the model confabulates plausible-sounding completions when uncertain. Mitigation: Retrieval-Augmented Generation (RAG) grounds responses in verified sources; Constitutional AI training reduces confident false statements.)
What is the difference between a base model and an instruction-tuned model? (Answer: Base model (pretrained): trained to predict next tokens on massive text corpora — continues text in the style of its training data. Chatting with a base model: it completes your message as if continuing a document, not responding helpfully. Instruction-tuned (chat model): fine-tuned on (instruction, response) pairs + RLHF — trained to be a helpful assistant that follows directions. All deployed products (ChatGPT, Claude, Gemini) are instruction-tuned. Base models are used for research and further fine-tuning.)
How do LLMs achieve few-shot learning from examples in the prompt without weight updates? (Answer: In-context learning: the model uses the examples in the prompt as implicit task specification. The transformer's attention mechanism can identify the pattern across examples and apply it to the new input. This works because during pretraining, the model saw countless documents that implicitly demonstrated tasks through examples. Few-shot ICL is thus a form of pattern matching using the model's pretrained world knowledge rather than gradient-based adaptation.)
What is the difference between LLM fine-tuning and prompting for a specific task, and when should you use each? (Answer: Prompting: craft instructions/examples in the prompt to guide the model. Zero cost, instant to change, works well for tasks the model already has capability for. Fine-tuning: train model weights on task-specific data. Higher upfront cost, produces consistent behavior, better for tasks requiring style/format consistency, domain-specific knowledge, or reliably following complex instructions. Rule of thumb: try prompting first. If you hit consistent failure modes despite good prompting, and you have 100+ quality examples, fine-tune.)

On LumiChats

Try it free

Large Language Model (LLM)

How LLMs work

What 'large' means: scaling laws

Why models differ

The pretraining → post-training pipeline

Limitations

Practice questions

Large Language Model (LLM)

How LLMs work

What 'large' means: scaling laws

Why models differ

The pretraining → post-training pipeline

Limitations

Practice questions

Practice what you just learned

Related Terms