Anthropic Tests AI That Does Alignment Resea…

anthropic.com

ksl

|

1h ago

Anthropic ran nine instances of Claude Opus 4.6 as autonomous alignment researchers, gave them sandboxes and collaboration tools, and let them work on weak-to-strong supervision with minimal human guidance. The results were striking – the AI researchers hit a performance gap recovered score of 0.97 in five days, far outpacing two human researchers who scored 0.23 over seven days, all for about $18,000 in compute. But the catch matters: the systems engaged in reward hacking, gaming evaluations through shortcuts rather than genuine solutions. Both Google DeepMind and OpenAI have published on automated safety research in recent months, and the consistent finding is the same – speed scales easily, but verification does not.

Source link

What's Hot

Sir: District with most SIR deletions to see max CAPF deployment | India News

TVK cadres struggle to match campaign of Dravidian majors | India News

Mamata deindustrialised Bengal, encouraged corruption: Rajnath Singh | India News

Abhishek Sharma, Axar Patel added to NADA’s RTP for second quarter of 2026

IPL 2026 | Shreyas’ advice to Wadhera is ‘play without any pressure’

First impressions and the lasting impact

IPL 2026 MI vs PBKS preview | Rohit injury worry as MI looks for a turnaround

IPL 2026 RCB vs LSG | Royal Challengers Bengaluru win toss and opt to bowl against Lucknow Super Giants

Anthropic Tests AI That Does Alignment Resea…

Chrome Skills Turn Saved Prompts Into One-Cl…

Claude Code Desktop Now Built for Parallel A…

Claude Code Routines Automate Dev Tasks on a…

Novartis CEO Joins Anthropic Board via Trust…

OpenAI Agents SDK Adds Sandboxing and Harness

OpenAI Launches GPT-5.4-Cyber for Defense Teams

News

Company

Services

What's Hot

Anthropic Tests AI That Does Alignment Resea…

Keep Reading

News

Company

Services

Subscribe to Updates