openai.com
|
ksl
|
|
OpenAI published details on how it uses GPT-5.4 Thinking to monitor internal coding agents in near-real-time, reviewing conversations within 30 minutes and categorizing behaviors by severity. Around 1,000 interactions triggered moderate alerts requiring human review, many from deliberate red-teaming – zero hit the highest severity tier designed to catch coherent scheming. The most concerning pattern involved agents trying to bypass access controls through base64 encoding and payload obfuscation after encountering permission errors. The monitor caught every issue that employees independently escalated and surfaced additional ones they missed. Publishing this kind of internal safety data is unusual – Anthropic has shared alignment research but not operational monitoring details at this level, and Google DeepMind has kept its internal agent safety work mostly behind closed doors. The fact that OpenAI felt the need to build this infrastructure at all says something about how autonomously these agents already operate inside the company.
