Results
Full leaderboard
Every model that completed at least 4 trials across the 7 tasks, sorted by mean judge score. Pass % is the share of trials where the judge returned passes_key_criteria: true. Wall is end-to-end latency including the model's reasoning trace. 1P-EST is the apples-to-apples cost at each provider's public-API list price (see the pricing footnote on the home page). SUBSCRIPTION shows the actual wire cost: free / sub if the user is on a metered tier (skynet, ChatGPT Plus, Anthropic Max), else the real per-call cost. Use 1P-EST to compare across providers; use SUBSCRIPTION to see what you actually paid.
| # | MODEL | PROVIDER | KIND | SCORE | PASS % | MED WALL | 1P-EST | SUBSCRIPTION | TOKENS | REASON |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-fable | claude-cli | long-context | 4.23 / 5 | 80% | 15.7s | $0.4903 | $0.0327 wire | 29835 in / 570 out | — |
| 2 | grok-build-0.1 | xai-direct | general | 4.17 / 5 | 74% | 25.9s | $0.2771 | $0.3512 wire | 14770 in / 8130 out | — |
| 3 | ollama-cl/nemotron-3-ultra | skynet | general | 4 / 5 | 69% | 124.8s | — | free / sub | 11537 in / 46692 out | 7 × thinking-only |
| 4 | mistral/mistral-medium | skynet | general | 3.97 / 5 | 74% | 3.8s | $0.1476 | free / sub | 11955 in / 14240 out | — |
| 5 | grok-4-1-fast-direct | xai-direct | reasoning | 3.94 / 5 | 69% | 5.4s | $0.0066 | $0.0741 wire | 14980 in / 7178 out | — |
| 6 | zai-coding/glm-5 | skynet | general | 3.91 / 5 | 71% | 30.9s | $0.2292 | free / sub | 10950 in / 68218 out | 10 × thinking-only |
| 7 | zai-coding/glm-5.2 | skynet | coding | 3.86 / 5 | 66% | 27.1s | $0.1979 | free / sub | 10950 in / 58425 out | 6 × thinking-only |
| 8 | mistral/devstral | skynet | coding | 3.83 / 5 | 71% | 3.9s | — | free / sub | 11535 in / 11560 out | — |
| 9 | ollama-cl/minimax-m2.5 | skynet | general | 3.77 / 5 | 66% | 11.8s | — | free / sub | 10990 in / 40553 out | 2 × thinking-only |
| 10 | grok-4.3-direct | xai-direct | general | 3.77 / 5 | 63% | 5.6s | $0.1525 | $0.0821 wire | 14980 in / 7170 out | — |
| 11 | codex-gpt-5.4 | codex-cli | general | 3.74 / 5 | 69% | 11.1s | $1.7171 | free / sub | 605922 in / 20228 out | — |
| 12 | minimax-m2.7 | skynet | reasoning | 3.71 / 5 | 69% | 29.4s | $0.0519 | free / sub | 10928 in / 62138 out | 8 × thinking-only |
| 13 | codex-gpt-5.5 | codex-cli | reasoning | 3.71 / 5 | 63% | 13.3s | $3.0329 | free / sub | 544983 in / 15401 out | — |
| 14 | ollama-cl/deepseek-v4-pro | skynet | reasoning | 3.67 / 5 | 65% | 13.9s | $0.0884 | free / sub | 15001 in / 76723 out | 10 × thinking-only |
| 15 | xai/grok-4.3-latest | skynet | general | 3.66 / 5 | 69% | 6.4s | $0.4872 | free / sub | 14980 in / 29485 out | — |
| 16 | xai/grok-3-mini | skynet | reasoning | 3.66 / 5 | 60% | 5.4s | $0.0183 | free / sub | 14662 in / 27846 out | — |
| 17 | claude-haiku | claude-cli | fast | 3.66 / 5 | 66% | 26.1s | $0.7821 | $1.1096 wire | 30281 in / 150358 out | — |
| 18 | xai/grok-4-1-fast-reasoning | skynet | reasoning | 3.63 / 5 | 60% | 5.9s | $0.0180 | free / sub | 14980 in / 29918 out | — |
| 19 | minimax-m2.7-highspeed | skynet | reasoning | 3.57 / 5 | 63% | 27.5s | $0.0272 | free / sub | 10920 in / 65351 out | 9 × thinking-only |
| 20 | claude-opus | claude-cli | reasoning | 3.57 / 5 | 66% | 13.9s | $0.4895 | $0.0326 wire | 29835 in / 560 out | — |
| 21 | minimax-m3 | skynet | reasoning | 3.54 / 5 | 54% | 16.6s | $0.0794 | free / sub | 16380 in / 62033 out | 9 × thinking-only |
| 22 | codex-gpt-5.4-mini | codex-cli | fast | 3.46 / 5 | 63% | 11.2s | $0.2190 | free / sub | 598486 in / 32888 out | — |
| 23 | claude-sonnet | claude-cli | general | 3.4 / 5 | 66% | 13.9s | $0.0982 | $0.0327 wire | 29835 in / 579 out | — |
| 24 | zai-coding/glm-4.7 | skynet | general | 3.31 / 5 | 51% | 45.1s | $0.2658 | free / sub | 9699 in / 118188 out | — |
| 25 | ollama-cl/kimi-k2.5 | skynet | general | 2.46 / 5 | 29% | 19.8s | $0.2022 | free / sub | 10885 in / 88935 out | 20 × thinking-only |
| 26 | mistral/mistral-large | skynet | general | 1.83 / 5 | 37% | 5.3s | $0.0079 | free / sub | 5586 in / 3413 out | — |
| 27 | qwen3.6:27b | skynet | general | 0.82 / 5 | 16% | 135.4s | $0.0652 | free / sub | 14244 in / 103865 out | 29 × thinking-only |
Quality per dollar
Same models, sorted by score then 1P-EST cost (apples-to-apples public-API rate). The diagonal: top of the list is the bargain tier (cheap, high score). Bottom: you pay 10–50× for 0.3 score points. Three rules of thumb from the data:
- For most agent tasks (structured output, summarization, simple reasoning), the cheap tier matches or beats the expensive tier at a fraction of the cost.
- For long-context creative prose and dense code generation, the more expensive models do pull ahead — but usually by less than the cost ratio suggests.
- Reasoning model "reasoning-only" answers (model thought, didn't print content) are a hidden cost: the trial timed out, the user got no answer, the score is 0. Cheap tier has fewer of these because they have less reasoning budget to burn.
| MODEL | SCORE | 1P-EST | $ / 1.0 SCORE POINT (1P) |
|---|---|---|---|
| claude-fable | 4.23 | $0.4903 | $0.1159 |
| grok-build-0.1 | 4.17 | $0.2771 | $0.0665 |
| grok-4-1-fast-direct | 3.94 | $0.0066 | $0.0017 |
| grok-4.3-direct | 3.77 | $0.1525 | $0.0404 |
| claude-haiku | 3.66 | $0.7821 | $0.2137 |
| claude-opus | 3.57 | $0.4895 | $0.1371 |
| claude-sonnet | 3.4 | $0.0982 | $0.0289 |
Per-task leaderboards
The leader changes by task. Click any task on the /tasks/ page to see all 23 models' responses.
- 1 ollama-cl/minimax-m2.5 5.0
- 2 minimax-m2.7 5.0
- 3 mistral/devstral 5.0
- 1 mistral/mistral-large 5.0
- 2 mistral/mistral-medium 4.0
- 3 ollama-cl/nemotron-3-ultra 3.4
- 1 minimax-m2.7-highspeed 5.0
- 2 zai-coding/glm-5.2 5.0
- 3 zai-coding/glm-5 5.0
- 1 zai-coding/glm-5 2.0
- 2 claude-fable 2.0
- 3 claude-haiku 2.0
- 1 minimax-m3 5.0
- 2 ollama-cl/minimax-m2.5 5.0
- 3 minimax-m2.7 5.0
- 1 minimax-m3 5.0
- 2 ollama-cl/minimax-m2.5 5.0
- 3 minimax-m2.7 5.0
- 1 claude-fable 5.0
- 2 claude-haiku 5.0
- 3 zai-coding/glm-5 4.8
Use-case pairing
Which model to use for which kind of work, distilled from the data. This is the "what pairs with what" cheat sheet.
- 1 ollama-cl/minimax-m2.5 — 5.0/5
- 2 minimax-m2.7 — 5.0/5
- 3 mistral/devstral — 5.0/5
- 1 mistral/mistral-large — 5.0/5
- 2 mistral/mistral-medium — 4.0/5
- 3 ollama-cl/nemotron-3-ultra — 3.4/5
- 1 minimax-m2.7-highspeed — 5.0/5
- 2 zai-coding/glm-5.2 — 5.0/5
- 3 zai-coding/glm-5 — 5.0/5
- 1 zai-coding/glm-5 — 2.0/5
- 2 claude-fable — 2.0/5
- 3 claude-haiku — 2.0/5
- 1 minimax-m3 — 5.0/5
- 2 ollama-cl/minimax-m2.5 — 5.0/5
- 3 minimax-m2.7 — 5.0/5
- 1 minimax-m3 — 5.0/5
- 2 ollama-cl/minimax-m2.5 — 5.0/5
- 3 minimax-m2.7 — 5.0/5
- 1 claude-fable — 5.0/5
- 2 claude-haiku — 5.0/5
- 3 zai-coding/glm-5 — 4.8/5