AI Foundations · Benchmark literacy after MMLU

The new spine

You can name the six benchmarks that replaced MMLU as the 2026 reading spine, and match each to the failure mode of its predecessor that it was built to resist.

Refreshed: May 2026.

When a benchmark dies, the replacements always arrive with a specific design intent. They are engineered against the cause of the previous death. Knowing which death each new benchmark is the answer to is most of the literacy.

This chapter walks the six. Four lessons because two of them — METR Time Horizon and the τ²/AuditBench pair — are tied conceptually and worth holding together.

- Humanity's Last Exam is the post-MMLU knowledge ceiling. ~3,000 closed-book frontier questions across math, sciences and humanities. Private holdout. Designed to stay hard: top models were under 10% in January 2025, ~46% in March 2026 (GPT-5.4). Lesson 1. - FrontierMath is research-grade math, validated by Terence Tao and Tim Gowers. PhD-hours-per-problem. Top frontier at 0.476 (GPT-5.4) — nowhere near saturation. Lesson 2. - ARC-AGI-2 is Chollet's 2025 visual-reasoning successor to the original ARC. Grids that are trivial for humans and hard for memorising machines. Most frontier reasoning models score under 20%. Lesson 3. - METR Time Horizon 1.1 measures the duration of autonomous task completion, doubling every ~7 months since 2019 and every ~4 months since 2024. Plus τ²-Bench for tool-use under policy constraints, and AuditBench for alignment auditing — three benchmarks that together cover the new "agent and alignment" surface. Lesson 4.

Chapter contains 4 lessons.