blog.can.ac
|
ksl
|
|
Can Bölük ran a single afternoon experiment that moved coding benchmark scores dramatically – Grok Code Fast 1 jumped from 6.7% to 68.3% – without touching any model weights. The trick was replacing the edit tool itself. His “hashline” format tags each line with a short content hash, so models reference anchors instead of reproducing exact text. Patch-based formats, it turns out, are the worst performer for nearly every model tested. Fifteen different LLMs improved, some cutting output tokens by over 60%. The finding lands squarely in a growing conversation around how much of what we attribute to model capability is actually harness and scaffolding design. Vendors restricting open-source harness experimentation may be leaving the easiest gains on the table.
