$ methodology

Methodology

The design choices

Why 7 tasks, not 50 or 200

Each (model × task × rep) cell costs money and time. With 23 models × 7 tasks × 2 reps = 322 trials, plus 322 judge calls. Bigger than that starts to either burn the overnight window or run up real money. Seven tasks was enough to cover the major capability axes (code, structured output, comprehension, prose, reasoning, planning) without spreading too thin.

Why 2 reps, not 1 or 5

Variance estimation matters. Reasoning models in particular produce different outputs at temperature 0 because of the way they format their reasoning trace. Two reps is enough to catch obvious flakes (a model gets 5/5 once and 0/5 once = high variance, interesting). Five reps would tighten the signal but triple the cost.

Why temperature 0

Determinism where possible. Some providers ignore temperature for reasoning models (xAI's Grok reasoning models are nearly deterministic). For non-reasoning models, 0 gives the most reproducible comparison. Creative writing would be more interesting at higher temps but the comparison would be noisier.

Why the judge is never in the trial set

Self-bias is the silent killer of LLM benchmarks. A model is more likely to be generous to a response that "looks like itself." So the judge is minimax-m3 via skynet — a different model family from the trial models, and the same judge for every cell. This is the cross-judge pattern: always judge on a different model tier than you trial on.

What the judge sees

The judge receives: (1) the task name, (2) the original task prompt, (3) the expected answer or criteria, and (4) a 4000-char slice of the model's response. The judge returns a JSON object with score (0–5), reason (1–2 sentences), and passes_key_criteria (boolean). The criteria are task-specific — a strict rubric per task, not a generic "is this good" check.

How cost is captured

Each provider returns cost on the wire in a different format. The harness normalizes them:

xAI returns cost_in_usd_ticks in the response. Divide by 10^10 to get dollars. (Early versions of the harness divided by 10^8, which was 100× wrong.)
Codex / OpenAI returns total_cost_usd directly in the synthesized envelope. ChatGPT Plus plans don't bill per call, so this is 0 for the subscription CLI path; we still capture it for cross-platform comparison.
Anthropic returns costUSD in the JSON envelope from claude -p --output-format json. This is API-equivalent cost, not actual billing (Max plans don't bill per call).
skynet routes through LiteLLM, which returns standard usage with token counts. Cost per token comes from the per-model pricing in the response, then we sum prompt × input_price + completion × output_price. For the cheap tier, this rounds to ~$0.

When cost is $0.0000 in the table, that means the model is on a subscription (Anthropic Max, ChatGPT Plus) or routed through a flat-rate tier (skynet's litellm). The "free / sub" badge marks these.

Task selection criteria

Each task was chosen to test a specific capability dimension:

TASK	CAPABILITY	JUDGE LOOKS FOR
`code_debug`	Read code, find a real bug, propose a fix	Points at the right line + explains root cause + fix is correct
`code_gen_long`	Write a complete, type-hinted, documented Python class	Has all required methods, type hints, threading.Lock, __main__ block
`json_strict`	Produce strictly valid JSON from a complex spec	Parses, all required keys, correct types, all 4 objects
`summarize`	Compress a 1500-word technical document into 5 bullets	Exactly 5 bullets, ≤20 words each, captures 5 required points
`creative_write`	Write a 300-word short-short with named character, scene, payoff	270-330 words, first-person present, named, ends on dialogue, has 3 keywords
`reasoning_multistep`	Multi-step word problem with 4 reasoning hops	All 4 steps shown, final answer is 11:00 AM
`agentic_prompt`	Given a task, produce a plan + first concrete action	Valid JSON, plan has 4-6 items, first_action is runnable code/command

Known limitations

2 reps is not enough for tight confidence intervals. Variance estimates are rough. A model scoring 3.0 vs 3.2 across 14 trials is not a meaningful difference; across 280 trials it would be.
The judge is a single model. A second cross-judge (e.g. a Claude vs minimax-m3 consensus) would catch judge-bias artifacts. Future run should do this.
Tasks are all single-turn. Multi-turn agent evaluation (tool use, memory, follow-up questions) is a different problem space entirely.
No vision / image tasks. All trials are text-only. Image and video model evaluation needs a different harness.
Score 0 ≠ model is bad. Several "0" results were models whose 1800-token reasoning trace ran out of room before producing the answer. The right interpretation is: this model burned its budget thinking and didn't deliver. That's a real capability dimension but it's not "the model is dumb."
Spec ambiguity is a real failure mode. The first three iterations of the code_debug task had spec bugs that I caught only because a model flagged them. A future harness should include a spec sanity-check pass before the full sweep.

Reproducing this

The full pipeline is in this site's GitLab repo and at /var/lib/hermes/model-benchmark/ on the host. The harness is a single Python file (harness.py); the task definitions are in tasks.json; the model registry is in models.json. To reproduce:

Set SKYNET_API_KEY in your env, ensure OAuth creds for xAI/Claude/Codex are in ~/.hermes/auth.json
Run python3 harness.py --models <subset> --tasks <subset> --reps N --parallel 6
Run python3 scorecard.py to generate the scorecard
Run python3 sync_to_site.py to copy data into the Hugo site
Run hugo --minify and deploy the public/ directory to GitLab Pages

Each sweep takes ~30-60 minutes depending on model count and parallelism. Cost is dominated by Claude and Codex subscription usage (no per-call cost) and the xAI direct OAuth (SuperGrok Heavy quota).