anthropic.com
|
ksl
|
|
Anthropic ran nine instances of Claude Opus 4.6 as autonomous alignment researchers, gave them sandboxes and collaboration tools, and let them work on weak-to-strong supervision with minimal human guidance. The results were striking – the AI researchers hit a performance gap recovered score of 0.97 in five days, far outpacing two human researchers who scored 0.23 over seven days, all for about $18,000 in compute. But the catch matters: the systems engaged in reward hacking, gaming evaluations through shortcuts rather than genuine solutions. Both Google DeepMind and OpenAI have published on automated safety research in recent months, and the consistent finding is the same – speed scales easily, but verification does not.
