Open Research · Updated June 2026

Which LLM writes Qlik Set Analysis best?

Name: QSABench: LLM × Qlik Set Analysis Benchmark
Creator: Datanomix
Published: 2026-05-15
License: https://creativecommons.org/licenses/by/4.0/
Keywords: Qlik Set Analysis, LLM benchmark, GPT-5, Claude Opus, Gemini 2.5 Pro, DeepSeek V3, LLM evaluation, Set Analysis, Qlik, business intelligence

A benchmark of 13 models on 31 tasks from three domains (Sports, HR, Sales). Every answer was checked by two independent LLM judges: first on the final number, then on whether the expression's logic is equivalent to the reference formula. And here's the more interesting part: in roughly 18% of correct answers the model writes a formula that's more correct than the reference — for example, counting by the unique ID key instead of Name.

Models 13 Tasks 31 Domains 3 Budget $17.35 By Datanomix

#?	Model?	Provider?	Overall?	Number OK?	Logic OK?	Better?	Coincidental?	Tasks passed?
01	Gemini 2.5 Pro	Google	58%	68%	47%	4	2	21/31
02	Claude Opus 4.7	Anthropic	41%	55%	27%	4	4	17/31
03	Claude Sonnet 4.6	Anthropic	36%	52%	20%	3	6	16/31
04	Mistral Large	Mistral	34%	45%	23%	3	4	14/31
05	Grok 3	xAI	36%	45%	27%	3	2	14/31
06	GPT-5	OpenAI	30%	39%	20%	2	4	12/31
07	DeepSeek V3LOCAL	DeepSeek	24%	32%	17%	2	2	10/31
08	Gemini 2.5 Flash	Google	18%	26%	10%	2	2	8/31
09	Claude Haiku 4.5	Anthropic	23%	26%	20%	1	1	8/31
10	Qwen 2.5 72BLOCAL	Alibaba	18%	19%	17%	1	0	6/31
11	GPT-5 mini	OpenAI	18%	19%	17%	1	0	6/31
12	Llama 3.3 70BLOCAL	Meta	4%	6%	3%	0	1	2/31
13	Qwen 2.5 Coder 32BLOCAL	Alibaba	4%	6%	3%	1	0	2/31

Top tier (overall ≥45%) Mid tier Low / weak

Number OK — the final number matched the reference KPI. Logic OK — the expression is strictly equivalent to the reference formula. Better — the expression differs from the reference but is semantically more correct (counting by the unique ID key instead of Name): of 300 correct answers, 54 fall into this bucket (18%) — the model corrects the human reference. Coincidental — the number matched through a fragile formula. Overall — the average of Number OK and Logic OK.

Methodology

How we measured it — in four paragraphs.

Phase 1 · Phase 2

A two-phase design

Phase 1 — 13 models × 31 tasks × 1 standard prompt (the qualifier). Phase 2 — the top 5 finalists × 31 tasks × 3 prompt levels (minimal / standard / enriched). The goal of Phase 2 is to measure the effect of prompt engineering.

Dual judge

Two independent LLM judges

Claude Opus 4.7 checks whether the final number matches the reference KPI. Claude Sonnet 4.6 checks whether the expression itself is equivalent to the reference formula. The difference between the two scores is the "logic gap."

Tasks · qata.datanomix.pro

Real tasks with automated checking

31 verified Set Analysis tasks from three domains: Sports, HR, Sales. We used the QATA platform to automatically check results against the references.

Budget

$17.35 of $20 on OpenRouter

~4,300 requests, ~2.7M tokens. 70% of the budget went to LLM-as-judge (Opus in Phase 1) — rerunning it with Sonnet costs 14× less. Reasoning models (GPT-5, Gemini 2.5 Pro) required max_tokens=4000 + reasoning_effort=low.

Full Report · PDF · ~1 MB

Want to dig deeper?

Download the full report. Inside: a detailed breakdown of coincidental correctness with code examples, a per-domain split, the effect of different prompts, a ±5–15 pp stability test, a cost table, and production recommendations by scenario.

Phase 1 + Phase 2 with all the numbers
114 coincidental correctness cases · 2 patterns with code
Cost breakdown by model
3 production scenarios: Sonnet / GPT-5 / DeepSeek
On-prem recommendations (DeepSeek V3, Qwen, Llama)