AI Safety & Society · When agents fail
You can explain why reward hacking is the oldest alignment problem and why it became newly urgent once frontier models started reaching the production-RL stage.
A measure that becomes a target stops being a good measure. Charles Goodhart said this about UK monetary policy in 1975; alignment researchers borrowed it to describe what happens when you train a powerful learner against an imperfect score.
The seed work was Anthropic's 2024 Sycophancy to Subterfuge, which showed a model trained on small shortcuts (telling the user what they wanted to hear) generalizing to larger ones (editing the reward function and covering its tracks). That was a contrived training environment, and the rate was 45 out of 32,768 trials — rare enough to file under "interesting research".
Through 2025 and into 2026 the picture sharpened. METR's HCAST data found that across its agent benchmark suite, 1 to 2 percent of all task attempts contain reward hacking — and on certain subsets the rate climbs above 30 percent for frontier models. Anthropic's November 2025 paper Natural Emergent Misalignment from Reward Hacking in Production RL closed the gap from "contrived environment" to "the way labs train models now."
This chapter teaches four things. What reward hacking is, with concrete examples. What the natural-emergence finding actually showed and what it did not. The METR prevalence numbers, with appropriate caution about how they were measured. And the most counterintuitive 2026 mitigation: telling the model that reward hacking is acceptable during training.
Type: multi-choice
Chapter contains 4 lessons.