A benchmark of 13 models on 31 tasks from three domains (Sports, HR, Sales). Every answer was checked by two independent LLM judges: first on the final number, then on whether the expression's logic is equivalent to the reference formula. And here's the more interesting part: in roughly 18% of correct answers the model writes a formula that's more correct than the reference — for example, counting by the unique ID key instead of Name.
| # | Model | Provider | Overall | Number OK | Logic OK | Better | Coincidental | Tasks passed |
|---|---|---|---|---|---|---|---|---|
| 01 | Gemini 2.5 Pro | 58% | 68% | 47% | 4 | 2 | 21/31 | |
| 02 | Claude Opus 4.7 | Anthropic | 41% | 55% | 27% | 4 | 4 | 17/31 |
| 03 | Claude Sonnet 4.6 | Anthropic | 36% | 52% | 20% | 3 | 6 | 16/31 |
| 04 | Mistral Large | Mistral | 34% | 45% | 23% | 3 | 4 | 14/31 |
| 05 | Grok 3 | xAI | 36% | 45% | 27% | 3 | 2 | 14/31 |
| 06 | GPT-5 | OpenAI | 30% | 39% | 20% | 2 | 4 | 12/31 |
| 07 | DeepSeek V3LOCAL | DeepSeek | 24% | 32% | 17% | 2 | 2 | 10/31 |
| 08 | Gemini 2.5 Flash | 18% | 26% | 10% | 2 | 2 | 8/31 | |
| 09 | Claude Haiku 4.5 | Anthropic | 23% | 26% | 20% | 1 | 1 | 8/31 |
| 10 | Qwen 2.5 72BLOCAL | Alibaba | 18% | 19% | 17% | 1 | 0 | 6/31 |
| 11 | GPT-5 mini | OpenAI | 18% | 19% | 17% | 1 | 0 | 6/31 |
| 12 | Llama 3.3 70BLOCAL | Meta | 4% | 6% | 3% | 0 | 1 | 2/31 |
| 13 | Qwen 2.5 Coder 32BLOCAL | Alibaba | 4% | 6% | 3% | 1 | 0 | 2/31 |
Phase 1 — 13 models × 31 tasks × 1 standard prompt (the qualifier). Phase 2 — the top 5 finalists × 31 tasks × 3 prompt levels (minimal / standard / enriched). The goal of Phase 2 is to measure the effect of prompt engineering.
Claude Opus 4.7 checks whether the final number matches the reference KPI. Claude Sonnet 4.6 checks whether the expression itself is equivalent to the reference formula. The difference between the two scores is the "logic gap."
31 verified Set Analysis tasks from three domains: Sports, HR, Sales. We used the QATA platform to automatically check results against the references.
~4,300 requests, ~2.7M tokens. 70% of the budget went to LLM-as-judge (Opus in Phase 1) — rerunning it with Sonnet costs 14× less. Reasoning models (GPT-5, Gemini 2.5 Pro) required max_tokens=4000 + reasoning_effort=low.
Download the full report. Inside: a detailed breakdown of coincidental correctness with code examples, a per-domain split, the effect of different prompts, a ±5–15 pp stability test, a cost table, and production recommendations by scenario.