Open Research · Updated June 2026

Which LLM writes Qlik Set Analysis best?

A benchmark of 13 models on 31 tasks from three domains (Sports, HR, Sales). Every answer was checked by two independent LLM judges: first on the final number, then on whether the expression's logic is equivalent to the reference formula. And here's the more interesting part: in roughly 18% of correct answers the model writes a formula that's more correct than the reference — for example, counting by the unique ID key instead of Name.

Models 13 Tasks 31 Domains 3 Budget $17.35 By Datanomix
#? Model? Provider? Overall? Number OK? Logic OK? Better? Coincidental? Tasks passed?
01 Gemini 2.5 Pro Google 58% 68% 47% 4 2 21/31
02 Claude Opus 4.7 Anthropic 41% 55% 27% 4 4 17/31
03 Claude Sonnet 4.6 Anthropic 36% 52% 20% 3 6 16/31
04 Mistral Large Mistral 34% 45% 23% 3 4 14/31
05 Grok 3 xAI 36% 45% 27% 3 2 14/31
06 GPT-5 OpenAI 30% 39% 20% 2 4 12/31
07 DeepSeek V3LOCAL DeepSeek 24% 32% 17% 2 2 10/31
08 Gemini 2.5 Flash Google 18% 26% 10% 2 2 8/31
09 Claude Haiku 4.5 Anthropic 23% 26% 20% 1 1 8/31
10 Qwen 2.5 72BLOCAL Alibaba 18% 19% 17% 1 0 6/31
11 GPT-5 mini OpenAI 18% 19% 17% 1 0 6/31
12 Llama 3.3 70BLOCAL Meta 4% 6% 3% 0 1 2/31
13 Qwen 2.5 Coder 32BLOCAL Alibaba 4% 6% 3% 1 0 2/31
Top tier (overall ≥45%) Mid tier Low / weak

Number OK — the final number matched the reference KPI. Logic OK — the expression is strictly equivalent to the reference formula. Better — the expression differs from the reference but is semantically more correct (counting by the unique ID key instead of Name): of 300 correct answers, 54 fall into this bucket (18%) — the model corrects the human reference. Coincidental — the number matched through a fragile formula. Overall — the average of Number OK and Logic OK.
Methodology

How we measured it — in four paragraphs.

Phase 1 · Phase 2

A two-phase design

Phase 1 — 13 models × 31 tasks × 1 standard prompt (the qualifier). Phase 2 — the top 5 finalists × 31 tasks × 3 prompt levels (minimal / standard / enriched). The goal of Phase 2 is to measure the effect of prompt engineering.

Dual judge

Two independent LLM judges

Claude Opus 4.7 checks whether the final number matches the reference KPI. Claude Sonnet 4.6 checks whether the expression itself is equivalent to the reference formula. The difference between the two scores is the "logic gap."

Tasks · qata.datanomix.pro

Real tasks with automated checking

31 verified Set Analysis tasks from three domains: Sports, HR, Sales. We used the QATA platform to automatically check results against the references.

Budget

$17.35 of $20 on OpenRouter

~4,300 requests, ~2.7M tokens. 70% of the budget went to LLM-as-judge (Opus in Phase 1) — rerunning it with Sonnet costs 14× less. Reasoning models (GPT-5, Gemini 2.5 Pro) required max_tokens=4000 + reasoning_effort=low.

Full Report · PDF · ~1 MB

Want to dig deeper?

Download the full report. Inside: a detailed breakdown of coincidental correctness with code examples, a per-domain split, the effect of different prompts, a ±5–15 pp stability test, a cost table, and production recommendations by scenario.

  • Phase 1 + Phase 2 with all the numbers
  • 114 coincidental correctness cases · 2 patterns with code
  • Cost breakdown by model
  • 3 production scenarios: Sonnet / GPT-5 / DeepSeek
  • On-prem recommendations (DeepSeek V3, Qwen, Llama)
✓ Thanks! The report opened in a new tab. If it didn't — click here.
Reproduce on GitHub