AI Foundations · Benchmark literacy after MMLU

Why the old benchmarks died

You can describe the three independent failure modes — saturation, contamination, leaderboard gaming — that retired the benchmarks the field relied on between 2020 and 2025.

Refreshed: May 2026.

For most of the deep-learning era, a benchmark and a leaderboard were the same thing. You picked a fixed test set, ran every model on it, ordered the scores, and called the top one state-of-the-art. ImageNet ran that script from 2012 to about 2018. MMLU ran it from 2020 to roughly 2024. SWE-bench Verified ran it through 2025. None of those three are usable as live arbiters today, and they died in three different ways.

Saturation is the cleanest cause of death. A benchmark stops being interesting when every frontier model bunches at the top — say, 96, 97, 98 — because the differences are inside sampling noise. MMLU sits there. So does GPQA Diamond at 92–94%. So does SWE-bench Verified, where Opus 4.7 and DeepSeek V4-Pro now both clear 80%. Lesson 1 of this chapter.

Contamination kills benchmarks more quietly. Frontier models are trained on web crawls; benchmark questions live on the web; near-paraphrases of those questions end up in the corpus; the headline score rises by recognition, not reasoning. By 2024 the careful re-tests of MMLU and HumanEval were finding that big chunks of recent gains were leakage artefacts. Lesson 2.

Chapter contains 3 lessons.