What is major LLM benchmarks and what they measure?

LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation: Major LLM benchmarks and what they measure. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/llm-benchmarks-evaluation

What is benchmark limitations and real-world evaluation?

LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation: Benchmark limitations and real-world evaluation. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/llm-benchmarks-evaluation

What is practice questions?

LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation: Practice questions. Learn more in the LumiChats AI Glossary at https://lumichats.com/glossary/llm-benchmarks-evaluation

LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation

LLM benchmarks are standardized test suites measuring specific capabilities: MMLU (multitask knowledge), HumanEval (code generation), GSM8K (math reasoning), HellaSwag (commonsense), MATH (competition mathematics), and MT-Bench (instruction following). Benchmark scores are essential for comparing models but have well-known limitations — benchmark saturation, data contamination (training on test data), and poor correlation with real-world deployment performance. The industry increasingly combines automated benchmarks with human evaluation, A/B testing in production, and task-specific evaluation suites.

How we measure AI performance — and why benchmark scores don't tell the full story.

Category: Model Training & Optimization

Major LLM benchmarks and what they measure

Benchmark	Measures	Format	Human baseline	GPT-4 score
MMLU	Knowledge across 57 subjects (law, medicine, CS, history)	Multiple choice, 4 options	~88%	~87%
HumanEval	Python code generation correctness	Complete function from docstring	N/A	~67% pass@1
GSM8K	Grade school math word problems	Free-form reasoning + answer	~98%	~92%
MATH	Competition mathematics (AMC, AIME level)	Multi-step problem solving	~40%	~42%
HellaSwag	Physical commonsense (activity completion)	Multiple choice sentence completion	~95%	~95%
MT-Bench	Multi-turn instruction following quality	GPT-4 judges 1-10 score	N/A	8.99/10
BIG-Bench Hard	Hard reasoning tasks requiring multi-step	Multiple choice	N/A	Varies widely

# The Eleuther AI Language Model Evaluation Harness is the standard tool
# pip install lm-eval

# Command-line evaluation (most common pattern)
# Evaluate Llama-3.2-1B on MMLU and GSM8K
"""
lm_eval --model hf     --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct     --tasks mmlu,gsm8k     --device cuda:0     --batch_size 8     --output_path ./results/llama_1b
"""

# Python API evaluation
from lm_eval import simple_evaluate
from lm_eval.models.huggingface import HFLM
import json

# Load model for evaluation
lm_obj = HFLM(pretrained="unsloth/Llama-3.2-1B-Instruct",
              dtype="bfloat16", device="cuda")

# Run evaluation on multiple benchmarks
results = simple_evaluate(
    model=lm_obj,
    tasks=["mmlu", "gsm8k", "hellaswag"],
    num_fewshot={"mmlu": 5, "gsm8k": 8, "hellaswag": 10},  # Few-shot examples
    batch_size=8,
)

# Print results
for task, metrics in results["results"].items():
    acc = metrics.get("acc,none", metrics.get("acc_norm,none", "N/A"))
    print(f"{task:20}: {acc:.3f}" if isinstance(acc, float) else f"{task:20}: {acc}")

# ── Custom benchmark for your specific use case ──
# Standard benchmarks rarely match production requirements
# Build task-specific evaluation suites

def evaluate_sql_generation(model, tokenizer, test_cases):
    """Evaluate model on SQL generation for your schema."""
    correct = 0
    for prompt, expected_sql, db_schema in test_cases:
        full_prompt = f"Schema: {db_schema}
Question: {prompt}
SQL:"
        inputs  = tokenizer(full_prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=200, temperature=0)
        generated_sql = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Execute both SQLs and compare results
        try:
            result_generated = execute_sql(generated_sql)
            result_expected  = execute_sql(expected_sql)
            if set(map(tuple, result_generated)) == set(map(tuple, result_expected)):
                correct += 1
        except:
            pass  # SQL error = wrong

    return correct / len(test_cases)

# ── LLM-as-judge evaluation (MT-Bench style) ──
from openai import OpenAI
client = OpenAI()

def llm_judge_response(question: str, response: str, reference: str = None) -> dict:
    """Use GPT-4o-mini as an evaluator (cheaper than GPT-4)."""
    rubric = """Rate this response 1-10 on:
- Accuracy (is it factually correct?)
- Completeness (does it fully answer the question?)
- Clarity (is it easy to understand?)
Provide a JSON: {"scores": {"accuracy": X, "completeness": X, "clarity": X}, "reasoning": "..."}"""

    eval_prompt = f"Question: {question}
Response: {response}

{rubric}"
    if reference:
        eval_prompt += f"
Reference answer: {reference}"

    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"}
    )
    return json.loads(result.choices[0].message.content)

Benchmark limitations and real-world evaluation

Benchmark contamination: Models trained on internet text may have seen benchmark test sets. MMLU questions appear in many online study guides. Contaminated models score artificially high — not because they generalize, but because they memorized test questions during pretraining.
Benchmark saturation: GPT-4 scores ~88% on MMLU (same as human average). This does not mean GPT-4 has human-level knowledge — it means MMLU is too easy to differentiate frontier models. New harder benchmarks (GPQA, ARC-AGI) are constantly needed.
Distribution mismatch: MMLU measures multiple-choice test performance. Production LLMs primarily answer open-ended questions, write code, and hold conversations. High MMLU score does not guarantee good conversational ability.
Goodhart's Law in benchmarks: Once a benchmark is widely used, developers optimize specifically for it. Models can be fine-tuned to ace MMLU without improving general knowledge.

Gold standard: Chatbot Arena (LMSYS): Chatbot Arena (lmsys.org/chat) is the most trustworthy LLM ranking: users submit prompts, two anonymous models respond, user picks the winner. Results aggregate to Elo ratings. Unlike static benchmarks, Arena reflects diverse real-world usage, is contamination-resistant (new prompts every day), and is extremely hard to game. Claude, GPT-4o, and Gemini Ultra compete at the top of this leaderboard.

Practice questions

Model A scores 87% on MMLU, Model B scores 82%. Does this mean Model A is better for production use? (Answer: Not necessarily. MMLU measures academic knowledge in multiple-choice format. Production performance depends on the specific task: code generation, conversation quality, instruction following, safety. Always evaluate on your specific use case. Model B might score higher on HumanEval (code) or have lower latency for your response time requirements.)
What is benchmark contamination and why does it matter? (Answer: LLMs pretrain on internet text which includes benchmark test sets. A model that has memorized MMLU questions scores high without truly understanding the material — analogous to cheating on an exam. Contamination makes it hard to fairly compare models and overestimates capabilities. Detection: check if accuracy on held-out variants drops significantly.)
Why is Chatbot Arena (Elo-based) considered more reliable than static benchmarks? (Answer: User prompts are diverse, fresh (daily new prompts), and match real-world use patterns. Anonymous comparison eliminates bias toward known models. Elo system averages thousands of real preferences. No fixed answer key = no contamination. Hard to game — you cannot train specifically on tomorrow's user prompts.)
pass@k in HumanEval measures what? (Answer: The probability that at least 1 of k generated code samples passes all unit tests. pass@1 = accuracy with one attempt. pass@10 = probability of at least one correct solution in 10 attempts. Higher k → more chances to get it right. Production code assistants effectively use pass@10+ since users can ask for regeneration.)
Your fine-tuned model scores 95% on your custom evaluation dataset but performs poorly in production. What might explain this? (Answer: Overfitting to the evaluation dataset (if it overlaps with fine-tuning data). Distribution shift between evaluation examples and real user queries. Evaluation prompts may be easier than real prompts (cherry-picked). Automated metrics miss important quality dimensions. Solution: use a held-out test set never seen during training, add human evaluation of production samples.)

LumiChats is evaluated using a combination of automated benchmarks (MMLU, HumanEval, MT-Bench), human preference ratings (similar to Chatbot Arena), and production A/B testing. Understanding these evaluation frameworks helps you interpret capability claims about AI products critically.

Benchmark

Measures

Format

Human baseline

GPT-4 score

MMLU

Knowledge across 57 subjects (law, medicine, CS, history)

Multiple choice, 4 options

~88%

~87%

HumanEval

Python code generation correctness

Complete function from docstring

N/A

~67% pass@1

GSM8K

Grade school math word problems

Free-form reasoning + answer

~98%

~92%

MATH

Competition mathematics (AMC, AIME level)

Multi-step problem solving

~40%

~42%

HellaSwag

Physical commonsense (activity completion)

Multiple choice sentence completion

~95%

MT-Bench

Multi-turn instruction following quality

GPT-4 judges 1-10 score

N/A

8.99/10

BIG-Bench Hard

Hard reasoning tasks requiring multi-step

Multiple choice

N/A

Varies widely

# The Eleuther AI Language Model Evaluation Harness is the standard tool # pip install lm-eval # Command-line evaluation (most common pattern) # Evaluate Llama-3.2-1B on MMLU and GSM8K """ lm_eval --model hf --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct --tasks mmlu,gsm8k --device cuda:0 --batch_size 8 --output_path ./results/llama_1b """ # Python API evaluation from lm_eval import simple_evaluate from lm_eval.models.huggingface import HFLM import json # Load model for evaluation lm_obj = HFLM(pretrained="unsloth/Llama-3.2-1B-Instruct", dtype="bfloat16", device="cuda") # Run evaluation on multiple benchmarks results = simple_evaluate( model=lm_obj, tasks=["mmlu", "gsm8k", "hellaswag"], num_fewshot={"mmlu": 5, "gsm8k": 8, "hellaswag": 10}, # Few-shot examples batch_size=8, ) # Print results for task, metrics in results["results"].items(): acc = metrics.get("acc,none", metrics.get("acc_norm,none", "N/A")) print(f"{task:20}: {acc:.3f}" if isinstance(acc, float) else f"{task:20}: {acc}") # ── Custom benchmark for your specific use case ── # Standard benchmarks rarely match production requirements # Build task-specific evaluation suites def evaluate_sql_generation(model, tokenizer, test_cases): """Evaluate model on SQL generation for your schema.""" correct = 0 for prompt, expected_sql, db_schema in test_cases: full_prompt = f"Schema: {db_schema} Question: {prompt} SQL:" inputs = tokenizer(full_prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=200, temperature=0) generated_sql = tokenizer.decode(outputs[0], skip_special_tokens=True) # Execute both SQLs and compare results try: result_generated = execute_sql(generated_sql) result_expected = execute_sql(expected_sql) if set(map(tuple, result_generated)) == set(map(tuple, result_expected)): correct += 1 except: pass # SQL error = wrong return correct / len(test_cases) # ── LLM-as-judge evaluation (MT-Bench style) ── from openai import OpenAI client = OpenAI() def llm_judge_response(question: str, response: str, reference: str = None) -> dict: """Use GPT-4o-mini as an evaluator (cheaper than GPT-4).""" rubric = """Rate this response 1-10 on: - Accuracy (is it factually correct?) - Completeness (does it fully answer the question?) - Clarity (is it easy to understand?) Provide a JSON: {"scores": {"accuracy": X, "completeness": X, "clarity": X}, "reasoning": "..."}""" eval_prompt = f"Question: {question} Response: {response} {rubric}" if reference: eval_prompt += f" Reference answer: {reference}" result = client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": eval_prompt}], response_format={"type": "json_object"} ) return json.loads(result.choices[0].message.content)

LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation

Major LLM benchmarks and what they measure

Benchmark limitations and real-world evaluation

Practice questions

LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation

Major LLM benchmarks and what they measure

Benchmark limitations and real-world evaluation

Practice questions

Practice what you just learned

Related Terms