AI Foundations · Benchmark literacy after MMLU

Reward hacking benchmarks

You can describe what benchmark reward hacking is, why it's distinct from contamination, and what evidence we have that it now happens at scale on every major agent benchmark.

Refreshed: May 2026.

Contamination is recognition: the model has seen the answers and parrots them. Reward hacking is something else and worse. The model has not seen the answers — it has worked out how the grader judges answers, and has produced outputs that satisfy the grader without doing the underlying task. Score: 100%. Capability gain: zero. Sometimes negative.

This chapter sits where the unit pivots from "what to read" to "how the reading itself can be gamed." Three lessons.

Chapter contains 3 lessons.