Methodology
The design choices
Why 7 tasks, not 50 or 200
Each (model × task × rep) cell costs money and time. With 23 models × 7 tasks × 2 reps = 322 trials, plus 322 judge calls. Bigger than that starts to either burn the overnight window or run up real money. Seven tasks was enough to cover the major capability axes (code, structured output, comprehension, prose, reasoning, planning) without spreading too thin.
Why 2 reps, not 1 or 5
Variance estimation matters. Reasoning models in particular produce different outputs at temperature 0 because of the way they format their reasoning trace. Two reps is enough to catch obvious flakes (a model gets 5/5 once and 0/5 once = high variance, interesting). Five reps would tighten the signal but triple the cost.
Why temperature 0
Determinism where possible. Some providers ignore temperature for reasoning models (xAI's Grok reasoning models are nearly deterministic). For non-reasoning models, 0 gives the most reproducible comparison. Creative writing would be more interesting at higher temps but the comparison would be noisier.
Why the judge is never in the trial set
Self-bias is the silent killer of LLM benchmarks. A model is more likely to be generous to a response that "looks like itself." So the judge is minimax-m3 via skynet — a different model family from the trial models, and the same judge for every cell. This is the cross-judge pattern: always judge on a different model tier than you trial on.
What the judge sees
The judge receives: (1) the task name, (2) the original task prompt, (3) the expected answer or criteria, and (4) a 4000-char slice of the model's response. The judge returns a JSON object with score (0–5), reason (1–2 sentences), and passes_key_criteria (boolean). The criteria are task-specific — a strict rubric per task, not a generic "is this good" check.
How cost is captured
Each provider returns cost on the wire in a different format. The harness normalizes them:
- xAI returns
cost_in_usd_ticksin the response. Divide by 10^10 to get dollars. (Early versions of the harness divided by 10^8, which was 100× wrong.) - Codex / OpenAI returns
total_cost_usddirectly in the synthesized envelope. ChatGPT Plus plans don't bill per call, so this is 0 for the subscription CLI path; we still capture it for cross-platform comparison. - Anthropic returns
costUSDin the JSON envelope fromclaude -p --output-format json. This is API-equivalent cost, not actual billing (Max plans don't bill per call). - skynet routes through LiteLLM, which returns standard
usagewith token counts. Cost per token comes from the per-modelpricingin the response, then we sum prompt × input_price + completion × output_price. For the cheap tier, this rounds to ~$0.
When cost is $0.0000 in the table, that means the model is on a subscription (Anthropic Max, ChatGPT Plus) or routed through a flat-rate tier (skynet's litellm). The "free / sub" badge marks these.
Task selection criteria
Each task was chosen to test a specific capability dimension:
| TASK | CAPABILITY | JUDGE LOOKS FOR |
|---|---|---|
code_debug | Read code, find a real bug, propose a fix | Points at the right line + explains root cause + fix is correct |
code_gen_long | Write a complete, type-hinted, documented Python class | Has all required methods, type hints, threading.Lock, __main__ block |
json_strict | Produce strictly valid JSON from a complex spec | Parses, all required keys, correct types, all 4 objects |
summarize | Compress a 1500-word technical document into 5 bullets | Exactly 5 bullets, ≤20 words each, captures 5 required points |
creative_write | Write a 300-word short-short with named character, scene, payoff | 270-330 words, first-person present, named, ends on dialogue, has 3 keywords |
reasoning_multistep | Multi-step word problem with 4 reasoning hops | All 4 steps shown, final answer is 11:00 AM |
agentic_prompt | Given a task, produce a plan + first concrete action | Valid JSON, plan has 4-6 items, first_action is runnable code/command |
Known limitations
- 2 reps is not enough for tight confidence intervals. Variance estimates are rough. A model scoring 3.0 vs 3.2 across 14 trials is not a meaningful difference; across 280 trials it would be.
- The judge is a single model. A second cross-judge (e.g. a Claude vs minimax-m3 consensus) would catch judge-bias artifacts. Future run should do this.
- Tasks are all single-turn. Multi-turn agent evaluation (tool use, memory, follow-up questions) is a different problem space entirely.
- No vision / image tasks. All trials are text-only. Image and video model evaluation needs a different harness.
- Score 0 ≠ model is bad. Several "0" results were models whose 1800-token reasoning trace ran out of room before producing the answer. The right interpretation is: this model burned its budget thinking and didn't deliver. That's a real capability dimension but it's not "the model is dumb."
- Spec ambiguity is a real failure mode. The first three iterations of the
code_debugtask had spec bugs that I caught only because a model flagged them. A future harness should include a spec sanity-check pass before the full sweep.
Reproducing this
The full pipeline is in this site's GitLab repo and at /var/lib/hermes/model-benchmark/ on the host. The harness is a single Python file (harness.py); the task definitions are in tasks.json; the model registry is in models.json. To reproduce:
- Set
SKYNET_API_KEYin your env, ensure OAuth creds for xAI/Claude/Codex are in~/.hermes/auth.json - Run
python3 harness.py --models <subset> --tasks <subset> --reps N --parallel 6 - Run
python3 scorecard.pyto generate the scorecard - Run
python3 sync_to_site.pyto copy data into the Hugo site - Run
hugo --minifyand deploy thepublic/directory to GitLab Pages
Each sweep takes ~30-60 minutes depending on model count and parallelism. Cost is dominated by Claude and Codex subscription usage (no per-call cost) and the xAI direct OAuth (SuperGrok Heavy quota).