developers.openai.com
|
ksl
|
|
Derrick Choi from OpenAI ran GPT-5.3-Codex continuously for roughly 25 hours on a single task – building a full design tool from scratch. It consumed 13 million tokens and produced around 30,000 lines of code, including real-time collaboration, prototype mode, and multi-format export. The trick wasn’t model intelligence alone but a set of markdown files acting as durable memory: frozen specs, milestone plans, operational runbooks, and live decision logs that kept the agent coherent across an extended run. METR’s benchmarks show task complexity doubling every seven months for frontier agents, and this cookbook entry reads less like a tutorial than a proof point for that trend. The gap between coding assistant and autonomous teammate keeps narrowing.
