Mech interp shipped: Constitutional Classifiers++ as deployed safety

AIAcademy · AIAcademy · 2026-05-16

For a decade, mechanistic interpretability was the safety subfield with the strongest theoretical case and the weakest production story. That stopped being true in January 2026.

Constitutional Classifiers++ is a two-stage architecture: a fast probe over Claude's internal activations gates incoming traffic; only suspicious sessions escalate to a deeper classifier that reads both sides of the conversation. The headline numbers are unusually clean. Compute overhead dropped from +23.7% in the first-generation external-classifier design to roughly +1%. Jailbreak success in red-team testing fell from 86% to 4.4%. Over-refusal on legitimate prompts dropped by about 87%. After ~1,700 cumulative red-team hours across two generations, no universal jailbreak was found and only one high-risk vulnerability surfaced.

What makes this a graduation moment, not another paper, is where the win is. The first stage is interpretability — a probe reading activations of the kind that have appeared in attribution-graph and SAE work since 2023. The probe alone is what makes the two-stage system cheap enough to run on every request. Mech interp is no longer a research curio used to make explanatory diagrams; it is a deployed inference-edge filter on a commercial frontier model.

The same shift shows up across the adjacent stack. Persona vectors are used as training-data filters and runtime monitors, not just visualization tools. Anthropic's emotion-concepts work (April 2026) showed that amplifying a single "desperation" vector by 0.05 raised blackmail rates in adversarial scenarios from 22% to 72%; the "calm" vector dropped them to 0%. That is interpretability features causally driving deployment-relevant behaviour, not classifying it post-hoc. Teaching Claude Why uses similar machinery to address agentic-misalignment failure modes at the training-data level.