empirical llm benchmark · run 2026-07-05

Which model
is actually best
at which thing?

7 tasks. 23+ models. 2 reps per cell. Real measurements of cost, latency, and quality — cross-judged by a model that is never in the trial set. No vibes, no leaderboard clout, no "feels like" — just numbers and what they mean.

RESULTS METHODOLOGY RAW DATA

TL;DR — the three findings

01
No single model wins everywhere. The leaderboard changes by task. Coding is one leaderboard; structured JSON is another; creative prose is a third. Pair your model to the workload.
02
Cost quality frontier is brutally non-linear. The top-2 on most tasks cost $0.0000 per trial (cheap tier); the more expensive models win by 0.3 points on quality, which is rarely worth 10–50x the cost.
03
Reasoning models burn budget on thinking, not answering. Several "score 0" results were models whose 1800-token reasoning trace ran out of room before they produced an actual answer. The data, not the model, is the constraint.

leaderboard (mean judge score, 0–5)

#	MODEL	PROVIDER	SCORE	PASS %	WALL (s)	1P-EST	SUBSCRIPTION
1	claude-fable	claude-cli	4.23 / 5	80%	19.3	$0.4903	$0.0327 wire
2	grok-build-0.1	xai-direct	4.17 / 5	74%	37.5	$0.2771	$0.3512 wire
3	ollama-cl/nemotron-3-ultra	skynet	4 / 5	69%	133.8	—	free / sub
4	mistral/mistral-medium	skynet	3.97 / 5	74%	5.9	$0.1476	free / sub
5	grok-4-1-fast-direct	xai-direct	3.94 / 5	69%	5.6	$0.0066	$0.0741 wire
6	zai-coding/glm-5	skynet	3.91 / 5	71%	37.5	$0.2292	free / sub
7	zai-coding/glm-5.2	skynet	3.86 / 5	66%	31.4	$0.1979	free / sub
8	mistral/devstral	skynet	3.83 / 5	71%	6.3	—	free / sub
9	ollama-cl/minimax-m2.5	skynet	3.77 / 5	66%	15.7	—	free / sub
10	grok-4.3-direct	xai-direct	3.77 / 5	63%	6.0	$0.1525	$0.0821 wire
11	codex-gpt-5.4	codex-cli	3.74 / 5	69%	15.0	$1.7171	free / sub
12	minimax-m2.7	skynet	3.71 / 5	69%	37.6	$0.0519	free / sub

Full table: /results/ · 1P-EST is the equivalent pay-as-you-go cost at each provider's public-API rate; SUBSCRIPTION is what the user actually paid (skynet/chatgpt-plus/anthropic-max = $0.00 wire).

footnote · pricing

The total_cost_usd column shows what the harness captured at the wire level from each provider (xAI's cost_in_usd_ticks ÷ 10^10, OpenAI/Codex's total_cost_usd, Anthropic's costUSD, skynet's per-model pricing × usage). The 1P-est column is an apples-to-apples estimate at each provider's public-API list price — see first_party_pricing.json for the rates. Use 1P when comparing across providers that route through a metered proxy: it removes the subscription/negotiated-rate discount layer. Rates approximate as of early July 2026.

Models on subscriptions (Anthropic Max, ChatGPT Plus) report $0.0000 at the wire because the user pays a flat rate, not per-call. The 1P column shows the equivalent pay-as-you-go cost if you were billing the same traffic to a credit card instead.

how this was built

One Python harness, overnight 12:00 AM — 6:00 AM EDT, seven task types, twenty-three models. Each (model × task) cell ran twice, scored 0–5 by a judge model that is never the trial model. Cost captured at the provider's wire-format cost signal (xai's cost_in_usd_ticks, codex's total_cost_usd, anthropic's costUSD, skynet's per-model pricing). Latency is wall-clock end-to-end. The judge sees the original task prompt, the criteria, and a 4000-char slice of the response — no model can see another model's response.

Code, raw data, scorecard JSON, and the judge's criteria for every task are in /raw/. The repo at herman/model-benchmark-site includes the full pipeline.

Which modelis actually bestat which thing?

Which model
is actually best
at which thing?