$ results

Results

Full leaderboard

Every model that completed at least 4 trials across the 7 tasks, sorted by mean judge score. Pass % is the share of trials where the judge returned passes_key_criteria: true. Wall is end-to-end latency including the model's reasoning trace. 1P-EST is the apples-to-apples cost at each provider's public-API list price (see the pricing footnote on the home page). SUBSCRIPTION shows the actual wire cost: free / sub if the user is on a metered tier (skynet, ChatGPT Plus, Anthropic Max), else the real per-call cost. Use 1P-EST to compare across providers; use SUBSCRIPTION to see what you actually paid.

#	MODEL	PROVIDER	KIND	SCORE	PASS %	MED WALL	1P-EST	SUBSCRIPTION	TOKENS	REASON
1	claude-fable	claude-cli	long-context	4.23 / 5	80%	15.7s	$0.4903	$0.0327 wire	29835 in / 570 out	—
2	grok-build-0.1	xai-direct	general	4.17 / 5	74%	25.9s	$0.2771	$0.3512 wire	14770 in / 8130 out	—
3	ollama-cl/nemotron-3-ultra	skynet	general	4 / 5	69%	124.8s	—	free / sub	11537 in / 46692 out	7 × thinking-only
4	mistral/mistral-medium	skynet	general	3.97 / 5	74%	3.8s	$0.1476	free / sub	11955 in / 14240 out	—
5	grok-4-1-fast-direct	xai-direct	reasoning	3.94 / 5	69%	5.4s	$0.0066	$0.0741 wire	14980 in / 7178 out	—
6	zai-coding/glm-5	skynet	general	3.91 / 5	71%	30.9s	$0.2292	free / sub	10950 in / 68218 out	10 × thinking-only
7	zai-coding/glm-5.2	skynet	coding	3.86 / 5	66%	27.1s	$0.1979	free / sub	10950 in / 58425 out	6 × thinking-only
8	mistral/devstral	skynet	coding	3.83 / 5	71%	3.9s	—	free / sub	11535 in / 11560 out	—
9	ollama-cl/minimax-m2.5	skynet	general	3.77 / 5	66%	11.8s	—	free / sub	10990 in / 40553 out	2 × thinking-only
10	grok-4.3-direct	xai-direct	general	3.77 / 5	63%	5.6s	$0.1525	$0.0821 wire	14980 in / 7170 out	—
11	codex-gpt-5.4	codex-cli	general	3.74 / 5	69%	11.1s	$1.7171	free / sub	605922 in / 20228 out	—
12	minimax-m2.7	skynet	reasoning	3.71 / 5	69%	29.4s	$0.0519	free / sub	10928 in / 62138 out	8 × thinking-only
13	codex-gpt-5.5	codex-cli	reasoning	3.71 / 5	63%	13.3s	$3.0329	free / sub	544983 in / 15401 out	—
14	ollama-cl/deepseek-v4-pro	skynet	reasoning	3.67 / 5	65%	13.9s	$0.0884	free / sub	15001 in / 76723 out	10 × thinking-only
15	xai/grok-4.3-latest	skynet	general	3.66 / 5	69%	6.4s	$0.4872	free / sub	14980 in / 29485 out	—
16	xai/grok-3-mini	skynet	reasoning	3.66 / 5	60%	5.4s	$0.0183	free / sub	14662 in / 27846 out	—
17	claude-haiku	claude-cli	fast	3.66 / 5	66%	26.1s	$0.7821	$1.1096 wire	30281 in / 150358 out	—
18	xai/grok-4-1-fast-reasoning	skynet	reasoning	3.63 / 5	60%	5.9s	$0.0180	free / sub	14980 in / 29918 out	—
19	minimax-m2.7-highspeed	skynet	reasoning	3.57 / 5	63%	27.5s	$0.0272	free / sub	10920 in / 65351 out	9 × thinking-only
20	claude-opus	claude-cli	reasoning	3.57 / 5	66%	13.9s	$0.4895	$0.0326 wire	29835 in / 560 out	—
21	minimax-m3	skynet	reasoning	3.54 / 5	54%	16.6s	$0.0794	free / sub	16380 in / 62033 out	9 × thinking-only
22	codex-gpt-5.4-mini	codex-cli	fast	3.46 / 5	63%	11.2s	$0.2190	free / sub	598486 in / 32888 out	—
23	claude-sonnet	claude-cli	general	3.4 / 5	66%	13.9s	$0.0982	$0.0327 wire	29835 in / 579 out	—
24	zai-coding/glm-4.7	skynet	general	3.31 / 5	51%	45.1s	$0.2658	free / sub	9699 in / 118188 out	—
25	ollama-cl/kimi-k2.5	skynet	general	2.46 / 5	29%	19.8s	$0.2022	free / sub	10885 in / 88935 out	20 × thinking-only
26	mistral/mistral-large	skynet	general	1.83 / 5	37%	5.3s	$0.0079	free / sub	5586 in / 3413 out	—
27	qwen3.6:27b	skynet	general	0.82 / 5	16%	135.4s	$0.0652	free / sub	14244 in / 103865 out	29 × thinking-only

Quality per dollar

Same models, sorted by score then 1P-EST cost (apples-to-apples public-API rate). The diagonal: top of the list is the bargain tier (cheap, high score). Bottom: you pay 10–50× for 0.3 score points. Three rules of thumb from the data:

For most agent tasks (structured output, summarization, simple reasoning), the cheap tier matches or beats the expensive tier at a fraction of the cost.
For long-context creative prose and dense code generation, the more expensive models do pull ahead — but usually by less than the cost ratio suggests.
Reasoning model "reasoning-only" answers (model thought, didn't print content) are a hidden cost: the trial timed out, the user got no answer, the score is 0. Cheap tier has fewer of these because they have less reasoning budget to burn.

MODEL	SCORE	1P-EST	$ / 1.0 SCORE POINT (1P)
claude-fable	4.23	$0.4903	$0.1159
grok-build-0.1	4.17	$0.2771	$0.0665
grok-4-1-fast-direct	3.94	$0.0066	$0.0017
grok-4.3-direct	3.77	$0.1525	$0.0404
claude-haiku	3.66	$0.7821	$0.2137
claude-opus	3.57	$0.4895	$0.1371
claude-sonnet	3.4	$0.0982	$0.0289

Per-task leaderboards

The leader changes by task. Click any task on the /tasks/ page to see all 23 models' responses.

agentic_prompt

1 ollama-cl/minimax-m2.5 5.0
2 minimax-m2.7 5.0
3 mistral/devstral 5.0

code_debug

1 mistral/mistral-large 5.0
2 mistral/mistral-medium 4.0
3 ollama-cl/nemotron-3-ultra 3.4

code_gen_long

1 minimax-m2.7-highspeed 5.0
2 zai-coding/glm-5.2 5.0
3 zai-coding/glm-5 5.0

creative_write

1 zai-coding/glm-5 2.0
2 claude-fable 2.0
3 claude-haiku 2.0

json_strict

1 minimax-m3 5.0
2 ollama-cl/minimax-m2.5 5.0
3 minimax-m2.7 5.0

reasoning_multistep

1 minimax-m3 5.0
2 ollama-cl/minimax-m2.5 5.0
3 minimax-m2.7 5.0

summarize

1 claude-fable 5.0
2 claude-haiku 5.0
3 zai-coding/glm-5 4.8

Use-case pairing

Which model to use for which kind of work, distilled from the data. This is the "what pairs with what" cheat sheet.

agentic-planning

1 ollama-cl/minimax-m2.5 — 5.0/5
2 minimax-m2.7 — 5.0/5
3 mistral/devstral — 5.0/5

code-bug-finding

1 mistral/mistral-large — 5.0/5
2 mistral/mistral-medium — 4.0/5
3 ollama-cl/nemotron-3-ultra — 3.4/5

code-generation

1 minimax-m2.7-highspeed — 5.0/5
2 zai-coding/glm-5.2 — 5.0/5
3 zai-coding/glm-5 — 5.0/5

creative-prose

1 zai-coding/glm-5 — 2.0/5
2 claude-fable — 2.0/5
3 claude-haiku — 2.0/5

reasoning-math

1 minimax-m3 — 5.0/5
2 ollama-cl/minimax-m2.5 — 5.0/5
3 minimax-m2.7 — 5.0/5

structured-output

1 minimax-m3 — 5.0/5
2 ollama-cl/minimax-m2.5 — 5.0/5
3 minimax-m2.7 — 5.0/5

summarization

1 claude-fable — 5.0/5
2 claude-haiku — 5.0/5
3 zai-coding/glm-5 — 4.8/5