$ results

Results

Full leaderboard

Every model that completed at least 4 trials across the 7 tasks, sorted by mean judge score. Pass % is the share of trials where the judge returned passes_key_criteria: true. Wall is end-to-end latency including the model's reasoning trace. 1P-EST is the apples-to-apples cost at each provider's public-API list price (see the pricing footnote on the home page). SUBSCRIPTION shows the actual wire cost: free / sub if the user is on a metered tier (skynet, ChatGPT Plus, Anthropic Max), else the real per-call cost. Use 1P-EST to compare across providers; use SUBSCRIPTION to see what you actually paid.

#MODELPROVIDERKINDSCOREPASS %MED WALL1P-ESTSUBSCRIPTIONTOKENSREASON
1claude-fableclaude-clilong-context4.23 / 580%15.7s$0.4903$0.0327 wire29835 in / 570 out
2grok-build-0.1xai-directgeneral4.17 / 574%25.9s$0.2771$0.3512 wire14770 in / 8130 out
3ollama-cl/nemotron-3-ultraskynetgeneral4 / 569%124.8sfree / sub11537 in / 46692 out7 × thinking-only
4mistral/mistral-mediumskynetgeneral3.97 / 574%3.8s$0.1476free / sub11955 in / 14240 out
5grok-4-1-fast-directxai-directreasoning3.94 / 569%5.4s$0.0066$0.0741 wire14980 in / 7178 out
6zai-coding/glm-5skynetgeneral3.91 / 571%30.9s$0.2292free / sub10950 in / 68218 out10 × thinking-only
7zai-coding/glm-5.2skynetcoding3.86 / 566%27.1s$0.1979free / sub10950 in / 58425 out6 × thinking-only
8mistral/devstralskynetcoding3.83 / 571%3.9sfree / sub11535 in / 11560 out
9ollama-cl/minimax-m2.5skynetgeneral3.77 / 566%11.8sfree / sub10990 in / 40553 out2 × thinking-only
10grok-4.3-directxai-directgeneral3.77 / 563%5.6s$0.1525$0.0821 wire14980 in / 7170 out
11codex-gpt-5.4codex-cligeneral3.74 / 569%11.1s$1.7171free / sub605922 in / 20228 out
12minimax-m2.7skynetreasoning3.71 / 569%29.4s$0.0519free / sub10928 in / 62138 out8 × thinking-only
13codex-gpt-5.5codex-clireasoning3.71 / 563%13.3s$3.0329free / sub544983 in / 15401 out
14ollama-cl/deepseek-v4-proskynetreasoning3.67 / 565%13.9s$0.0884free / sub15001 in / 76723 out10 × thinking-only
15xai/grok-4.3-latestskynetgeneral3.66 / 569%6.4s$0.4872free / sub14980 in / 29485 out
16xai/grok-3-miniskynetreasoning3.66 / 560%5.4s$0.0183free / sub14662 in / 27846 out
17claude-haikuclaude-clifast3.66 / 566%26.1s$0.7821$1.1096 wire30281 in / 150358 out
18xai/grok-4-1-fast-reasoningskynetreasoning3.63 / 560%5.9s$0.0180free / sub14980 in / 29918 out
19minimax-m2.7-highspeedskynetreasoning3.57 / 563%27.5s$0.0272free / sub10920 in / 65351 out9 × thinking-only
20claude-opusclaude-clireasoning3.57 / 566%13.9s$0.4895$0.0326 wire29835 in / 560 out
21minimax-m3skynetreasoning3.54 / 554%16.6s$0.0794free / sub16380 in / 62033 out9 × thinking-only
22codex-gpt-5.4-minicodex-clifast3.46 / 563%11.2s$0.2190free / sub598486 in / 32888 out
23claude-sonnetclaude-cligeneral3.4 / 566%13.9s$0.0982$0.0327 wire29835 in / 579 out
24zai-coding/glm-4.7skynetgeneral3.31 / 551%45.1s$0.2658free / sub9699 in / 118188 out
25ollama-cl/kimi-k2.5skynetgeneral2.46 / 529%19.8s$0.2022free / sub10885 in / 88935 out20 × thinking-only
26mistral/mistral-largeskynetgeneral1.83 / 537%5.3s$0.0079free / sub5586 in / 3413 out
27qwen3.6:27bskynetgeneral0.82 / 516%135.4s$0.0652free / sub14244 in / 103865 out29 × thinking-only

Quality per dollar

Same models, sorted by score then 1P-EST cost (apples-to-apples public-API rate). The diagonal: top of the list is the bargain tier (cheap, high score). Bottom: you pay 10–50× for 0.3 score points. Three rules of thumb from the data:

  1. For most agent tasks (structured output, summarization, simple reasoning), the cheap tier matches or beats the expensive tier at a fraction of the cost.
  2. For long-context creative prose and dense code generation, the more expensive models do pull ahead — but usually by less than the cost ratio suggests.
  3. Reasoning model "reasoning-only" answers (model thought, didn't print content) are a hidden cost: the trial timed out, the user got no answer, the score is 0. Cheap tier has fewer of these because they have less reasoning budget to burn.
MODELSCORE1P-EST$ / 1.0 SCORE POINT (1P)
claude-fable4.23$0.4903$0.1159
grok-build-0.14.17$0.2771$0.0665
grok-4-1-fast-direct3.94$0.0066$0.0017
grok-4.3-direct3.77$0.1525$0.0404
claude-haiku3.66$0.7821$0.2137
claude-opus3.57$0.4895$0.1371
claude-sonnet3.4$0.0982$0.0289

Per-task leaderboards

The leader changes by task. Click any task on the /tasks/ page to see all 23 models' responses.

agentic_prompt
  1. 1 ollama-cl/minimax-m2.5 5.0
  2. 2 minimax-m2.7 5.0
  3. 3 mistral/devstral 5.0
code_debug
  1. 1 mistral/mistral-large 5.0
  2. 2 mistral/mistral-medium 4.0
  3. 3 ollama-cl/nemotron-3-ultra 3.4
code_gen_long
  1. 1 minimax-m2.7-highspeed 5.0
  2. 2 zai-coding/glm-5.2 5.0
  3. 3 zai-coding/glm-5 5.0
creative_write
  1. 1 zai-coding/glm-5 2.0
  2. 2 claude-fable 2.0
  3. 3 claude-haiku 2.0
json_strict
  1. 1 minimax-m3 5.0
  2. 2 ollama-cl/minimax-m2.5 5.0
  3. 3 minimax-m2.7 5.0
reasoning_multistep
  1. 1 minimax-m3 5.0
  2. 2 ollama-cl/minimax-m2.5 5.0
  3. 3 minimax-m2.7 5.0
summarize
  1. 1 claude-fable 5.0
  2. 2 claude-haiku 5.0
  3. 3 zai-coding/glm-5 4.8

Use-case pairing

Which model to use for which kind of work, distilled from the data. This is the "what pairs with what" cheat sheet.

agentic-planning
  1. 1 ollama-cl/minimax-m2.5 — 5.0/5
  2. 2 minimax-m2.7 — 5.0/5
  3. 3 mistral/devstral — 5.0/5
code-bug-finding
  1. 1 mistral/mistral-large — 5.0/5
  2. 2 mistral/mistral-medium — 4.0/5
  3. 3 ollama-cl/nemotron-3-ultra — 3.4/5
code-generation
  1. 1 minimax-m2.7-highspeed — 5.0/5
  2. 2 zai-coding/glm-5.2 — 5.0/5
  3. 3 zai-coding/glm-5 — 5.0/5
creative-prose
  1. 1 zai-coding/glm-5 — 2.0/5
  2. 2 claude-fable — 2.0/5
  3. 3 claude-haiku — 2.0/5
reasoning-math
  1. 1 minimax-m3 — 5.0/5
  2. 2 ollama-cl/minimax-m2.5 — 5.0/5
  3. 3 minimax-m2.7 — 5.0/5
structured-output
  1. 1 minimax-m3 — 5.0/5
  2. 2 ollama-cl/minimax-m2.5 — 5.0/5
  3. 3 minimax-m2.7 — 5.0/5
summarization
  1. 1 claude-fable — 5.0/5
  2. 2 claude-haiku — 5.0/5
  3. 3 zai-coding/glm-5 — 4.8/5