anthropic.com
|
ksl
|
|
Anthropic engineer Prithvi Rajasekaran published a detailed breakdown of how to keep coding agents running reliably for hours, using a GAN-inspired architecture where a generator agent builds features and a separate evaluator grades them via Playwright against hard pass/fail thresholds. The results are concrete – a solo agent spent $9 in 20 minutes and produced broken code, while the full three-agent harness (planner, generator, evaluator) ran 6 hours for $200 and shipped a working 10-feature application. A key finding: agents consistently praise their own work, so self-evaluation is essentially useless – separating critique into a dedicated agent is what makes the loop work. The post also reveals that Opus 4.6 eliminated the need for context resets that were mandatory with Sonnet 4.5, simplifying the harness significantly. LangChain’s agent harness post from earlier this month described similar patterns at a higher level, but Anthropic is now publishing the engineering specifics with cost breakdowns and failure modes.
