Which model
is actually best
at which thing?
7 tasks. 23+ models. 2 reps per cell. Real measurements of cost, latency, and quality — cross-judged by a model that is never in the trial set. No vibes, no leaderboard clout, no "feels like" — just numbers and what they mean.
- 01No single model wins everywhere. The leaderboard changes by task. Coding is one leaderboard; structured JSON is another; creative prose is a third. Pair your model to the workload.
- 02Cost quality frontier is brutally non-linear. The top-2 on most tasks cost $0.0000 per trial (cheap tier); the more expensive models win by 0.3 points on quality, which is rarely worth 10–50x the cost.
- 03Reasoning models burn budget on thinking, not answering. Several "score 0" results were models whose 1800-token reasoning trace ran out of room before they produced an actual answer. The data, not the model, is the constraint.
| # | MODEL | PROVIDER | SCORE | PASS % | WALL (s) | 1P-EST | SUBSCRIPTION |
|---|---|---|---|---|---|---|---|
| 1 | claude-fable | claude-cli | 4.23 / 5 | 80% | 19.3 | $0.4903 | $0.0327 wire |
| 2 | grok-build-0.1 | xai-direct | 4.17 / 5 | 74% | 37.5 | $0.2771 | $0.3512 wire |
| 3 | ollama-cl/nemotron-3-ultra | skynet | 4 / 5 | 69% | 133.8 | — | free / sub |
| 4 | mistral/mistral-medium | skynet | 3.97 / 5 | 74% | 5.9 | $0.1476 | free / sub |
| 5 | grok-4-1-fast-direct | xai-direct | 3.94 / 5 | 69% | 5.6 | $0.0066 | $0.0741 wire |
| 6 | zai-coding/glm-5 | skynet | 3.91 / 5 | 71% | 37.5 | $0.2292 | free / sub |
| 7 | zai-coding/glm-5.2 | skynet | 3.86 / 5 | 66% | 31.4 | $0.1979 | free / sub |
| 8 | mistral/devstral | skynet | 3.83 / 5 | 71% | 6.3 | — | free / sub |
| 9 | ollama-cl/minimax-m2.5 | skynet | 3.77 / 5 | 66% | 15.7 | — | free / sub |
| 10 | grok-4.3-direct | xai-direct | 3.77 / 5 | 63% | 6.0 | $0.1525 | $0.0821 wire |
| 11 | codex-gpt-5.4 | codex-cli | 3.74 / 5 | 69% | 15.0 | $1.7171 | free / sub |
| 12 | minimax-m2.7 | skynet | 3.71 / 5 | 69% | 37.6 | $0.0519 | free / sub |
Full table: /results/ · 1P-EST is the equivalent pay-as-you-go cost at each provider's public-API rate; SUBSCRIPTION is what the user actually paid (skynet/chatgpt-plus/anthropic-max = $0.00 wire).
The total_cost_usd column shows what the harness captured at the wire level from each provider (xAI's cost_in_usd_ticks ÷ 10^10, OpenAI/Codex's total_cost_usd, Anthropic's costUSD, skynet's per-model pricing × usage). The 1P-est column is an apples-to-apples estimate at each provider's public-API list price — see first_party_pricing.json for the rates. Use 1P when comparing across providers that route through a metered proxy: it removes the subscription/negotiated-rate discount layer. Rates approximate as of early July 2026.
Models on subscriptions (Anthropic Max, ChatGPT Plus) report $0.0000 at the wire because the user pays a flat rate, not per-call. The 1P column shows the equivalent pay-as-you-go cost if you were billing the same traffic to a credit card instead.
One Python harness, overnight 12:00 AM — 6:00 AM EDT, seven task types, twenty-three models. Each (model × task) cell ran twice, scored 0–5 by a judge model that is never the trial model. Cost captured at the provider's wire-format cost signal (xai's cost_in_usd_ticks, codex's total_cost_usd, anthropic's costUSD, skynet's per-model pricing). Latency is wall-clock end-to-end. The judge sees the original task prompt, the criteria, and a 4000-char slice of the response — no model can see another model's response.
Code, raw data, scorecard JSON, and the judge's criteria for every task are in /raw/. The repo at herman/model-benchmark-site includes the full pipeline.