Frontiers · Long-horizon autonomy
You can describe how METR's time-horizon measurement works, what its numbers say about 2026 frontier agents, and the three kinds of caveats serious readers attach to the curve.
You cannot reason about long-horizon agents without a measurement. METR's time horizon is the one the field has settled on: the human-time difficulty at which an agent's success rate hits 50%. It is not a quality score; it is a duration score.
Time Horizon 1.1 (Jan 2026) is the current canonical update. The suite grew from 170 to 228 tasks. Tasks at 8+ hours doubled from 14 to 31. The headline 50%-horizons climbed from "single hours" to a different regime entirely — and the doubling time itself has been accelerating, not slowing.
This chapter walks the curve in four lessons:
- 6.7.1.1 — The doubling rate itself: 7 months long-term, ~3 months since 2024, and what extrapolation does and doesn't tell you. - 6.7.1.2 — What "14h@50% / 16.8h@50%" means in practice, and which models hold those numbers (live leaderboard). - 6.7.1.3 — The 50% → 80% reliability gap, the most underrated number in agent literacy. - 6.7.1.4 — How to read the leaderboard honestly without being credulous about extrapolation, using METR's own task-suite, confidence-interval, and domain caveats.
The thing to carry into the rest of the unit: the curve is real and it is a fit, not a law. Treat every "agents will do X by 2027" claim as a doubling-rate bet, not a discovery.
Type: multi-choice
Chapter contains 4 lessons.