After MMLU: a benchmark literacy guide for 2027

AIAcademy · AIAcademy · 2026-05-16

Humanity's Last Exam — the post-saturation benchmark

MMLU is done. HumanEval is done. SWE-bench Verified is saturating fast — frontier models are at 75-80% and climbing a percentage point per month. If a benchmark is your map of what AI can do, you need a new map for 2026-2027. Here is the actual spine.

Humanity's Last Exam (HLE): 2,500 expert-written questions across maths, physics, biology, classics, law, and humanities — designed by domain specialists to be unsolvable by current models. As of May 2026, frontier models score 12-22%. This is the closest thing to a successor to MMLU at the graduate-expert tier.

FrontierMath: research-level mathematics, original problems, hidden test set. Tao, Borcherds, and Gowers vetted the difficulty as "extremely hard." Top models sit around 25-30% — but the headroom is real difficulty, not measurement noise.

ARC-AGI-2: visual-reasoning puzzles, deliberately weighted toward novel abstractions humans solve easily and models do not. Frontier scores remain under 50% while humans hit ~98%. The remaining gap is the cleanest "fluid intelligence" signal the field has.

METR Time Horizon: not a saturation benchmark — a continuous scale. How long a task can a model finish at 50% reliability? Mythos Preview sits at ~16.8 hours. The reliability gradient (50% → 80% → 95%) is where production engineering lives. See the time horizons article for the chart.

τ²-Bench: customer-service-style policy adherence as binary pass/fail. The grader cannot be reward-hacked the way SWE-bench can — a single rule violation fails the trace. Important because it is hard to fake.

AuditBench: looks for implanted hidden behaviors in fine-tuned models. The safety counterpart to capability evals. Headroom remains substantial.