claude.com
|
ksl
|
|
Anthropic updated its Skill Creator with a built-in evaluation framework that lets authors test agent skills without writing code – define prompts, set success criteria, get pass/fail results. A new benchmarking mode tracks pass rates, execution time, and token usage, and the whole thing plugs into CI pipelines. The more interesting piece is multi-agent testing: independent agents run evals in parallel with clean contexts, while comparator agents handle A/B tests between skill versions. There’s also a description optimizer that analyzes how skills trigger against sample prompts, splitting data into train/test sets and iterating up to five times to reduce misfires. Anthropic is steadily building the tooling layer that turns prompt engineering from craft into something closer to software QA – a pattern OpenAI and Google have been much slower to formalize.
