Glossary/LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation
Model Training & Optimization

LLM Benchmarks — MMLU, HumanEval, HellaSwag & Real-World Evaluation

How we measure AI performance — and why benchmark scores don't tell the full story.


Definition

LLM benchmarks are standardised test suites measuring specific capabilities: MMLU (multitask knowledge), HumanEval (code generation), GSM8K (math reasoning), HellaSwag (commonsense), MATH (competition mathematics), and MT-Bench (instruction following). Benchmark scores are essential for comparing models but have well-known limitations — benchmark saturation, data contamination (training on test data), and poor correlation with real-world deployment performance. The industry increasingly combines automated benchmarks with human evaluation, A/B testing in production, and task-specific evaluation suites.

Major LLM benchmarks and what they measure

BenchmarkMeasuresFormatHuman baselineGPT-4 score
MMLUKnowledge across 57 subjects (law, medicine, CS, history)Multiple choice, 4 options~88%~87%
HumanEvalPython code generation correctnessComplete function from docstringN/A~67% pass@1
GSM8KGrade school math word problemsFree-form reasoning + answer~98%~92%
MATHCompetition mathematics (AMC, AIME level)Multi-step problem solving~40%~42%
HellaSwagPhysical commonsense (activity completion)Multiple choice sentence completion~95%~95%
MT-BenchMulti-turn instruction following qualityGPT-4 judges 1-10 scoreN/A8.99/10
BIG-Bench HardHard reasoning tasks requiring multi-stepMultiple choiceN/AVaries widely

Running LLM benchmarks with lm-evaluation-harness

# The Eleuther AI Language Model Evaluation Harness is the standard tool
# pip install lm-eval

# Command-line evaluation (most common pattern)
# Evaluate Llama-3.2-1B on MMLU and GSM8K
"""
lm_eval --model hf     --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct     --tasks mmlu,gsm8k     --device cuda:0     --batch_size 8     --output_path ./results/llama_1b
"""

# Python API evaluation
from lm_eval import simple_evaluate
from lm_eval.models.huggingface import HFLM
import json

# Load model for evaluation
lm_obj = HFLM(pretrained="unsloth/Llama-3.2-1B-Instruct",
              dtype="bfloat16", device="cuda")

# Run evaluation on multiple benchmarks
results = simple_evaluate(
    model=lm_obj,
    tasks=["mmlu", "gsm8k", "hellaswag"],
    num_fewshot={"mmlu": 5, "gsm8k": 8, "hellaswag": 10},  # Few-shot examples
    batch_size=8,
)

# Print results
for task, metrics in results["results"].items():
    acc = metrics.get("acc,none", metrics.get("acc_norm,none", "N/A"))
    print(f"{task:20}: {acc:.3f}" if isinstance(acc, float) else f"{task:20}: {acc}")

# ── Custom benchmark for your specific use case ──
# Standard benchmarks rarely match production requirements
# Build task-specific evaluation suites

def evaluate_sql_generation(model, tokenizer, test_cases):
    """Evaluate model on SQL generation for your schema."""
    correct = 0
    for prompt, expected_sql, db_schema in test_cases:
        full_prompt = f"Schema: {db_schema}
Question: {prompt}
SQL:"
        inputs  = tokenizer(full_prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(**inputs, max_new_tokens=200, temperature=0)
        generated_sql = tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Execute both SQLs and compare results
        try:
            result_generated = execute_sql(generated_sql)
            result_expected  = execute_sql(expected_sql)
            if set(map(tuple, result_generated)) == set(map(tuple, result_expected)):
                correct += 1
        except:
            pass  # SQL error = wrong

    return correct / len(test_cases)

# ── LLM-as-judge evaluation (MT-Bench style) ──
from openai import OpenAI
client = OpenAI()

def llm_judge_response(question: str, response: str, reference: str = None) -> dict:
    """Use GPT-4o-mini as an evaluator (cheaper than GPT-4)."""
    rubric = """Rate this response 1-10 on:
- Accuracy (is it factually correct?)
- Completeness (does it fully answer the question?)
- Clarity (is it easy to understand?)
Provide a JSON: {"scores": {"accuracy": X, "completeness": X, "clarity": X}, "reasoning": "..."}"""

    eval_prompt = f"Question: {question}
Response: {response}

{rubric}"
    if reference:
        eval_prompt += f"
Reference answer: {reference}"

    result = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": eval_prompt}],
        response_format={"type": "json_object"}
    )
    return json.loads(result.choices[0].message.content)

Benchmark limitations and real-world evaluation

  • Benchmark contamination: Models trained on internet text may have seen benchmark test sets. MMLU questions appear in many online study guides. Contaminated models score artificially high — not because they generalise, but because they memorised test questions during pretraining.
  • Benchmark saturation: GPT-4 scores ~88% on MMLU (same as human average). This does not mean GPT-4 has human-level knowledge — it means MMLU is too easy to differentiate frontier models. New harder benchmarks (GPQA, ARC-AGI) are constantly needed.
  • Distribution mismatch: MMLU measures multiple-choice test performance. Production LLMs primarily answer open-ended questions, write code, and hold conversations. High MMLU score does not guarantee good conversational ability.
  • Goodhart's Law in benchmarks: Once a benchmark is widely used, developers optimise specifically for it. Models can be fine-tuned to ace MMLU without improving general knowledge.

Gold standard: Chatbot Arena (LMSYS)

Chatbot Arena (lmsys.org/chat) is the most trustworthy LLM ranking: users submit prompts, two anonymous models respond, user picks the winner. Results aggregate to Elo ratings. Unlike static benchmarks, Arena reflects diverse real-world usage, is contamination-resistant (new prompts every day), and is extremely hard to game. Claude, GPT-4o, and Gemini Ultra compete at the top of this leaderboard.

Practice questions

  1. Model A scores 87% on MMLU, Model B scores 82%. Does this mean Model A is better for production use? (Answer: Not necessarily. MMLU measures academic knowledge in multiple-choice format. Production performance depends on the specific task: code generation, conversation quality, instruction following, safety. Always evaluate on your specific use case. Model B might score higher on HumanEval (code) or have lower latency for your response time requirements.)
  2. What is benchmark contamination and why does it matter? (Answer: LLMs pretrain on internet text which includes benchmark test sets. A model that has memorised MMLU questions scores high without truly understanding the material — analogous to cheating on an exam. Contamination makes it hard to fairly compare models and overestimates capabilities. Detection: check if accuracy on held-out variants drops significantly.)
  3. Why is Chatbot Arena (Elo-based) considered more reliable than static benchmarks? (Answer: User prompts are diverse, fresh (daily new prompts), and match real-world use patterns. Anonymous comparison eliminates bias toward known models. Elo system averages thousands of real preferences. No fixed answer key = no contamination. Hard to game — you cannot train specifically on tomorrow's user prompts.)
  4. pass@k in HumanEval measures what? (Answer: The probability that at least 1 of k generated code samples passes all unit tests. pass@1 = accuracy with one attempt. pass@10 = probability of at least one correct solution in 10 attempts. Higher k → more chances to get it right. Production code assistants effectively use pass@10+ since users can ask for regeneration.)
  5. Your fine-tuned model scores 95% on your custom evaluation dataset but performs poorly in production. What might explain this? (Answer: Overfitting to the evaluation dataset (if it overlaps with fine-tuning data). Distribution shift between evaluation examples and real user queries. Evaluation prompts may be easier than real prompts (cherry-picked). Automated metrics miss important quality dimensions. Solution: use a held-out test set never seen during training, add human evaluation of production samples.)

On LumiChats

LumiChats is evaluated using a combination of automated benchmarks (MMLU, HumanEval, MT-Bench), human preference ratings (similar to Chatbot Arena), and production A/B testing. Understanding these evaluation frameworks helps you interpret capability claims about AI products critically.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms