meridian.ai
|
ksl
|
|
Meridian partnered with Cornell, CMU, and Scale AI to launch Spreadsheet Arena, where users submit prompts and blindly compare LLM-generated spreadsheets head to head. The most revealing finding isn’t about formulas – formatting and visual structure drive user preference more than lookup functions or conditionals. Claude-based models scored higher on numerical accuracy but lost on polish, while weaker models failed basic prompt compliance entirely. Finance professionals and crowd voters agreed only about half the time, especially around color-coding conventions. The benchmark fills a gap that coding-focused evaluations like SWE-bench never touched: most knowledge workers interact with AI through spreadsheets, not terminals, and until now there was no structured way to measure that.
