empirical llm benchmark · run 2026-07-05

Which model
is actually best
at which thing?

7 tasks. 23+ models. 2 reps per cell. Real measurements of cost, latency, and quality — cross-judged by a model that is never in the trial set. No vibes, no leaderboard clout, no "feels like" — just numbers and what they mean.

TL;DR — the three findings
  1. 01
    No single model wins everywhere. The leaderboard changes by task. Coding is one leaderboard; structured JSON is another; creative prose is a third. Pair your model to the workload.
  2. 02
    Cost quality frontier is brutally non-linear. The top-2 on most tasks cost $0.0000 per trial (cheap tier); the more expensive models win by 0.3 points on quality, which is rarely worth 10–50x the cost.
  3. 03
    Reasoning models burn budget on thinking, not answering. Several "score 0" results were models whose 1800-token reasoning trace ran out of room before they produced an actual answer. The data, not the model, is the constraint.
leaderboard (mean judge score, 0–5)
#MODELPROVIDERSCOREPASS %WALL (s)1P-ESTSUBSCRIPTION
1claude-fableclaude-cli4.23 / 580%19.3$0.4903$0.0327 wire
2grok-build-0.1xai-direct4.17 / 574%37.5$0.2771$0.3512 wire
3ollama-cl/nemotron-3-ultraskynet4 / 569%133.8free / sub
4mistral/mistral-mediumskynet3.97 / 574%5.9$0.1476free / sub
5grok-4-1-fast-directxai-direct3.94 / 569%5.6$0.0066$0.0741 wire
6zai-coding/glm-5skynet3.91 / 571%37.5$0.2292free / sub
7zai-coding/glm-5.2skynet3.86 / 566%31.4$0.1979free / sub
8mistral/devstralskynet3.83 / 571%6.3free / sub
9ollama-cl/minimax-m2.5skynet3.77 / 566%15.7free / sub
10grok-4.3-directxai-direct3.77 / 563%6.0$0.1525$0.0821 wire
11codex-gpt-5.4codex-cli3.74 / 569%15.0$1.7171free / sub
12minimax-m2.7skynet3.71 / 569%37.6$0.0519free / sub

Full table: /results/ · 1P-EST is the equivalent pay-as-you-go cost at each provider's public-API rate; SUBSCRIPTION is what the user actually paid (skynet/chatgpt-plus/anthropic-max = $0.00 wire).

footnote · pricing

The total_cost_usd column shows what the harness captured at the wire level from each provider (xAI's cost_in_usd_ticks ÷ 10^10, OpenAI/Codex's total_cost_usd, Anthropic's costUSD, skynet's per-model pricing × usage). The 1P-est column is an apples-to-apples estimate at each provider's public-API list price — see first_party_pricing.json for the rates. Use 1P when comparing across providers that route through a metered proxy: it removes the subscription/negotiated-rate discount layer. Rates approximate as of early July 2026.

Models on subscriptions (Anthropic Max, ChatGPT Plus) report $0.0000 at the wire because the user pays a flat rate, not per-call. The 1P column shows the equivalent pay-as-you-go cost if you were billing the same traffic to a credit card instead.

how this was built

One Python harness, overnight 12:00 AM — 6:00 AM EDT, seven task types, twenty-three models. Each (model × task) cell ran twice, scored 0–5 by a judge model that is never the trial model. Cost captured at the provider's wire-format cost signal (xai's cost_in_usd_ticks, codex's total_cost_usd, anthropic's costUSD, skynet's per-model pricing). Latency is wall-clock end-to-end. The judge sees the original task prompt, the criteria, and a 4000-char slice of the response — no model can see another model's response.

Code, raw data, scorecard JSON, and the judge's criteria for every task are in /raw/. The repo at herman/model-benchmark-site includes the full pipeline.