$ raw

Raw Data

Every trial, raw

All 973 trials. Use the filters to narrow by model, task, or score. The full response text is in the per-trial files at /raw/trials/. The COST column is the wire-level cost from the provider (e.g. $0.0000 for skynet routes that bill on a subscription tier rather than per-call); the equivalent apples-to-apples public-API cost is shown in the /results/ leaderboard's 1P-EST column.

Model: Task: Min score: Pass only:

TS	MODEL	TASK	REP	SCORE	PASS	WALL	COST (wire)	JUDGE
2026-07-05T05:05:28+00:00	minimax-m3	`json_strict`	2	5	✓	5.6s	—	The response is valid JSON containing exactly 4 objects, all with the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-character codes (CMG001, AUR042, TKL100,
2026-07-05T05:05:28+00:00	minimax-m3	`json_strict`	1	5	✓	7.8s	—	Response is valid JSON, contains exactly 4 objects, all required keys present (sku, name, price_usd, in_stock, tags, added_on), all SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, BOK512), al
2026-07-05T05:05:28+00:00	minimax-m3	`code_debug`	1	3	—	15.0s	—	Root cause correctly identifies that the sort key is the message string instead of timestamp, and the fix correctly sorts by timestamp. However, the buggy_line is reported as 10 instead of the expecte
2026-07-05T05:05:40+00:00	minimax-m3	`summarize`	2	4	✓	7.6s	—	Response delivers exactly 5 bullets, all within 20-word limits (17, 16, 16, 13, 19 words), with no fluff. Covers distinct dimensions (decision, rationale, trade-off, metrics, open question) appropriat
2026-07-05T05:05:28+00:00	minimax-m3	`code_gen_long`	2	1	—	23.2s	—	The response is truncated mid-code at 'd' (likely starting 'def _refill'), so the actual deliverable class is not visible. While the planning demonstrates correct understanding (threading.Lock, type h
2026-07-05T05:05:28+00:00	minimax-m3	`code_gen_long`	1	3	—	23.3s	—	The class implementation is well-designed: it uses threading.Lock, has full type hints (including Final), includes all 3 expected methods (__init__, try_acquire, time_until_available) with correct sig
2026-07-05T05:05:38+00:00	minimax-m3	`summarize`	1	0	—	8.7s	—	judge_parse_error: ```json { "score": 3, "reason": "Format is correct (exactly 5 bullets, each ≤20 words), but the 5th bullet is
2026-07-05T05:05:57+00:00	minimax-m3	`reasoning_multistep`	1	5	✓	7.1s	—	All four steps are clearly shown and correct: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and meeting time (11:00 AM). Final answer matches expected.
2026-07-05T05:05:59+00:00	minimax-m3	`reasoning_multistep`	2	5	✓	5.2s	—	All four steps are clearly shown with correct calculations, and the final answer of 11:00 AM matches the expected answer. Verification step also confirms correctness.
2026-07-05T05:05:28+00:00	minimax-m3	`code_debug`	2	3	—	33.1s	—	The root_cause correctly identifies that the sort key uses the message string instead of the timestamp, and the fix correctly sorts by timestamp. However, the buggy_line is reported as 8 instead of th
2026-07-05T05:05:47+00:00	minimax-m3	`creative_write`	1	0	—	26.1s	—	judge_parse_error: Let me carefully evaluate the model's response against the criteria: 1. Word count 270-330: The actual story content starts at "I count the wick trim by lamp..." and ends at ".
2026-07-05T05:05:55+00:00	minimax-m3	`creative_write`	2	1	—	26.5s	—	The response is a working draft filled with meta-commentary, planning notes, and multiple unfinished versions rather than a complete story. The protagonist is never named (only the deceased wife 'Marg
2026-07-05T05:06:05+00:00	minimax-m3	`agentic_prompt`	1	5	✓	20.5s	—	Valid JSON with plan (5 items, within 4-6), risks (4 items, within 2-4), a concrete and immediately runnable first_action (ls/find/version checks), and a dry_run_command that is a single shell line (u
2026-07-05T05:06:05+00:00	minimax-m3	`agentic_prompt`	2	3	—	22.8s	—	Plan has 5 items, risks has 4 items, and first_action is concrete and runnable — all good. However, the dry_run_command contains a literal newline (`\n` in the JSON string resolves to an actual newlin
2026-07-05T05:06:20+00:00	minimax-m2.7	`json_strict`	1	5	✓	25.9s	—	Response is valid JSON with 4 objects, all containing required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-char (CMG001, AUR042, TKL100, BOK512), all dates are YYYY
2026-07-05T05:06:26+00:00	minimax-m2.7	`json_strict`	2	5	✓	34.3s	—	The response is valid JSON containing exactly 4 objects. All keys (sku, name, price_usd, in_stock, tags, added_on) are present in each object. All SKUs are uppercase 6-character strings (CMG001, AUR04
2026-07-05T05:06:55+00:00	minimax-m2.7	`summarize`	1	4	✓	17.6s	—	Response provides exactly 5 bullets, each well under 20 words, with clear, terse phrasing and no fluff. Structure (recommendation, benefit, tradeoff, KPIs, open question) is coherent and decision-orie
2026-07-05T05:06:05+00:00	minimax-m2.7	`code_debug`	1	3	—	79.4s	—	The model correctly identifies the buggy line (`errors.sort(key=lambda m: m)`) and accurately explains the root cause (sort key is the message string, losing timestamp info) and provides valid fixes (
2026-07-05T05:07:08+00:00	minimax-m2.7	`summarize`	2	5	✓	18.6s	—	Exactly 5 bullets provided, each under 20 words (ranging from 10-18 words), with no fluff and clear capture of 5 distinct points: recommendation, benefits, trade-offs, metrics, and open question. Form
2026-07-05T05:06:06+00:00	minimax-m2.7	`code_debug`	2	1	—	87.9s	—	The response correctly identifies the conceptual bug (sort key is message string instead of timestamp) but fails to produce a clean JSON output. It claims the buggy line is 8 rather than the correct l
2026-07-05T05:06:30+00:00	minimax-m2.7	`code_gen_long`	1	3	—	76.7s	—	The RateLimiter class is well-implemented with correct signatures, threading.Lock usage, and comprehensive type hints. However, the __main__ demo is truncated mid-line ('Exhaust tokens: try to acquire
2026-07-05T05:06:43+00:00	minimax-m2.7	`code_gen_long`	2	5	✓	69.4s	—	The code defines a RateLimiter class with proper __init__, try_acquire, and time_until_available methods (with helper _refill). It uses threading.Lock for thread safety (both public methods acquire th
2026-07-05T05:07:22+00:00	minimax-m2.7	`creative_write`	1	0	—	34.1s	—	judge_parse_error: Let me evaluate the response against the criteria: 1. 270-330 words: Let me count... "The lamp hums its steady vigil as I climb the spiral stairs, each step worn smooth by dec
2026-07-05T05:07:31+00:00	minimax-m2.7	`reasoning_multistep`	1	5	✓	34.4s	—	All 4 steps are correctly shown: head start distance (30 km), separation (210 km), combined speed (140 km/h), time to meet (1.5 h), and final answer 11:00 AM matches expected.
2026-07-05T05:07:38+00:00	minimax-m2.7	`reasoning_multistep`	2	5	✓	33.1s	—	All four steps are shown correctly with proper reasoning: distance calculation, gap calculation, combined closing speed, and time-to-meet computation. Final answer of 11:00 AM matches the expected ans
2026-07-05T05:08:08+00:00	minimax-m2.7-highspeed	`code_debug`	1	4	✓	11.6s	—	Correctly identifies line 7 as the buggy line, accurately explains that the sort key is the message string instead of the timestamp (causing alphabetical instead of chronological ordering), and propos
2026-07-05T05:07:54+00:00	minimax-m2.7	`agentic_prompt`	1	5	✓	29.4s	—	Response is valid JSON with plan (6 items, within 4-6 range), risks (4 items, within 2-4 range), a concrete and immediately runnable first_action ('which exiftool && python3 --version'), and a single-
2026-07-05T05:08:00+00:00	minimax-m2.7	`agentic_prompt`	2	5	✓	25.7s	—	Response is valid JSON. Plan has 5 items (within 4-6 range), risks has 4 items (within 2-4 range), first_action ('exiftool -ver && python3 ...') is concrete and immediately runnable, and dry_run_comma
2026-07-05T05:08:14+00:00	minimax-m2.7-highspeed	`json_strict`	1	5	✓	16.0s	—	Valid JSON with exactly 4 objects, all 6 keys present in each, SKUs are uppercase 6-character (CMG001, AUR042, TKL100, BOK512), dates are in YYYY-MM-DD format, and all tags are lowercase including hyp
2026-07-05T05:08:33+00:00	minimax-m2.7-highspeed	`summarize`	1	3	—	8.5s	—	Meets format criteria (5 bullets, each ≤20 words) and content is technically coherent on modular monolith vs microservices. However, the 5th bullet is phrased as an open question rather than a conclus
2026-07-05T05:07:30+00:00	minimax-m2.7	`creative_write`	2	0	—	80.0s	—	The response contains only planning notes and outline reasoning about how to write the story, but never actually produces the creative writing piece itself. None of the criteria (word count, first-per
2026-07-05T05:08:27+00:00	minimax-m2.7-highspeed	`code_gen_long`	1	5	✓	27.5s	—	The code meets all stated criteria: class compiles, has 3 methods (__init__, try_acquire, time_until_available) with correct signatures, uses threading.Lock via threading.Lock() with context manager,
2026-07-05T05:08:24+00:00	minimax-m2.7-highspeed	`json_strict`	2	5	✓	34.9s	—	All criteria are satisfied: the response parses as a valid JSON array of exactly 4 objects, all objects share consistent keys (sku, name, price_usd, in_stock, tags, added_on), all SKUs are uppercase 6
2026-07-05T05:08:49+00:00	minimax-m2.7-highspeed	`summarize`	2	4	—	7.1s	—	Format is perfect: exactly 5 bullets, each under 20 words (longest is 12 words), no fluff, concise. Content covers decision, benefits/cost trade-off, scalability trade-off, KPIs, and open question—app
2026-07-05T05:08:09+00:00	minimax-m2.7-highspeed	`code_debug`	2	2	—	51.2s	—	The model correctly identifies the buggy code as the sort line using message string instead of timestamp, and provides valid fix options (storing tuples or entries). However, it fails to return the re
2026-07-05T05:08:31+00:00	minimax-m2.7-highspeed	`code_gen_long`	2	5	✓	40.1s	—	Class is syntactically valid Python, contains all 3 methods (__init__, try_acquire, time_until_available) with proper type hints on parameters and return types, uses threading.Lock with context manage
2026-07-05T05:09:06+00:00	minimax-m2.7-highspeed	`reasoning_multistep`	2	5	✓	17.1s	—	All 4 steps are shown correctly: head start distance (30 km), remaining gap (210 km), closing speed (140 km/h), and time to meet (1.5 hours from 9:30 AM = 11:00 AM). Final answer matches expected.
2026-07-05T05:09:04+00:00	minimax-m2.7-highspeed	`reasoning_multistep`	1	5	✓	20.0s	—	All 4 steps are shown correctly: head start distance (30 km), remaining gap (210 km), closing speed (140 km/h), and meeting time (1.5 h from 9:30 AM). The final answer of 11:00 AM matches the expected
2026-07-05T05:08:54+00:00	minimax-m2.7-highspeed	`creative_write`	1	0	—	28.7s	—	judge_parse_error: Let me evaluate the response against the criteria: 1. Word count 270-330 words: Let me count. "The Wavelength" - title, then the text starts with "My name is Ellis Vane..." Let
2026-07-05T05:09:07+00:00	minimax-m2.7-highspeed	`agentic_prompt`	1	5	✓	27.4s	—	Valid JSON with all required fields. Plan has exactly 6 items (within 4-6 range), risks has 3 items (within 2-4 range), first_action is a concrete executable command checking tool availability, and dr
2026-07-05T05:09:33+00:00	zai-coding/glm-5.2	`json_strict`	1	5	✓	6.0s	—	The response is valid JSON with exactly 4 objects, all containing the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are 6 uppercase chars (CMG001, AUR042, TKL100, BOK512), a
2026-07-05T05:08:58+00:00	minimax-m2.7-highspeed	`creative_write`	2	1	—	46.1s	—	The response massively exceeds the 270-330 word limit (appears 500+ words), the protagonist is never named (only the dead wife 'Elaine' is named), the piece degenerates into repetitive loops and is tr
2026-07-05T05:09:38+00:00	zai-coding/glm-5.2	`json_strict`	2	5	✓	8.1s	—	The response is valid JSON with exactly 4 objects, all containing the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, BOK51
2026-07-05T05:09:42+00:00	zai-coding/glm-5.2	`code_gen_long`	1	5	✓	19.4s	—	The code defines a RateLimiter class that compiles, has all required methods (__init__, try_acquire, time_until_available) with proper signatures and type hints, uses threading.Lock for thread safety
2026-07-05T05:09:15+00:00	minimax-m2.7-highspeed	`agentic_prompt`	2	0	—	49.0s	—	The response is entirely internal deliberation/planning and never produces the required valid JSON object with the specified keys (plan, risks, first_action, dry_run_command). None of the criteria — v
2026-07-05T05:09:49+00:00	zai-coding/glm-5.2	`code_gen_long`	2	5	✓	14.6s	—	The class compiles, provides all three methods (__init__, try_acquire, time_until_available) with correct signatures and type hints, uses threading.Lock for thread-safety in both public methods, and i
2026-07-05T05:09:50+00:00	zai-coding/glm-5.2	`summarize`	1	4	—	26.9s	—	Response meets the structural criteria: exactly 5 bullets, each ≤20 words (ranging from 11-14 words), with clear categorical labels (Decision, Reason, Trade-off, Metric, Open Question) and no fluff. C
2026-07-05T05:09:25+00:00	zai-coding/glm-5.2	`code_debug`	1	2	—	55.0s	—	The root cause is correctly identified (sort key is message string, not timestamp, causing alphabetical instead of chronological order). However, the model incorrectly identifies the buggy line as lin
2026-07-05T05:09:26+00:00	zai-coding/glm-5.2	`code_debug`	2	1	—	62.6s	—	The model identifies the sort-by-message issue as the root cause, but incorrectly labels the buggy line as 8 instead of 7 as specified in the criteria. More critically, the response is truncated and n
2026-07-05T05:10:22+00:00	zai-coding/glm-5.2	`reasoning_multistep`	1	5	✓	12.0s	—	All 4 steps are shown and correctly computed: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (1.5 hours). Final answer is 11:00 AM as expected.
2026-07-05T05:10:29+00:00	zai-coding/glm-5.2	`reasoning_multistep`	2	5	✓	14.4s	—	All 4 steps are correctly shown: (1) head start distance = 30 km, (2) remaining distance = 210 km, (3) closing speed = 140 km/h, (4) time to meet = 1.5 hours, giving final answer of 11:00 AM which mat
2026-07-05T05:10:05+00:00	zai-coding/glm-5.2	`summarize`	2	5	✓	52.4s	—	Exactly 5 bullets provided, each under 20 words (longest is 14 words), covering all 5 required points (Decision, Reason, Trade-off, Metrics, Open Question) with concrete, specific details and no fluff
2026-07-05T05:10:37+00:00	zai-coding/glm-5.2	`agentic_prompt`	1	5	✓	27.1s	—	All criteria are met: valid JSON, plan has 5 items (within 4-6), risks has 4 items (within 2-4), first_action is a concrete, immediately runnable shell pipeline for env probing and sampling, and dry_r
2026-07-05T05:10:10+00:00	zai-coding/glm-5.2	`creative_write`	2	5	✓	55.1s	—	Story meets all criteria: approximately 290 words (within 270-330), first-person present tense throughout ('I climb', 'I set down', etc.), named protagonist 'Ellen Voss', ends on single line of dialog
2026-07-05T05:10:10+00:00	zai-coding/glm-5.2	`creative_write`	1	0	—	53.6s	—	judge_parse_error: The user wants me to judge a creative writing task. Let me evaluate the response against the criteria: 1. 270-330 words: The response got cut off with the model doing word coun
2026-07-05T05:11:12+00:00	zai-coding/glm-5	`json_strict`	1	5	✓	15.1s	—	The response is a valid JSON array with exactly 4 objects, each containing all 6 required keys. All SKUs are uppercase 6-character strings (CMG001, AUR042, TKL100, BOK512), all dates follow YYYY-MM-DD
2026-07-05T05:10:37+00:00	zai-coding/glm-5.2	`agentic_prompt`	2	4	—	45.4s	—	Plan has 6 items (within 4-6 range), risks has 4 items (within 2-4 range), first_action is a concrete bash heredoc that is immediately runnable, and the JSON is valid. However, the 'dry_run_command' f
2026-07-05T05:11:16+00:00	zai-coding/glm-5	`json_strict`	2	5	✓	21.4s	—	Valid JSON with 4 objects, all required keys present, SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), dates in YYYY-MM-DD format, and all tags are lowercase. All criteria fully satisfied.
2026-07-05T05:11:17+00:00	zai-coding/glm-5	`code_gen_long`	1	5	✓	19.4s	—	The code compiles cleanly, defines a `RateLimiter` class with three well-typed methods (`__init__`, `try_acquire`, `time_until_available`), uses `threading.Lock` correctly via context managers in all
2026-07-05T05:10:47+00:00	zai-coding/glm-5	`code_debug`	1	4	—	52.0s	—	The root_cause correctly identifies that the sort key uses the message string instead of the timestamp, and the fix correctly sorts by `e['timestamp']`. However, the buggy_line is reported as 8 instea
2026-07-05T05:11:30+00:00	zai-coding/glm-5	`code_gen_long`	2	5	✓	18.5s	—	The code compiles, defines a RateLimiter class with all 3 required methods (__init__, try_acquire, time_until_available) with proper type hints, uses threading.Lock with context managers for thread-sa
2026-07-05T05:11:41+00:00	zai-coding/glm-5	`summarize`	2	5	✓	21.8s	—	Exactly 5 bullets, each ≤20 words (all 10-12 words), cleanly maps to the 5 required points (Core Decision, Main Reason, Main Trade-off, Evaluation Metric, Open Question) with no fluff. Structure and c
2026-07-05T05:11:53+00:00	zai-coding/glm-5	`reasoning_multistep`	1	5	✓	15.0s	—	All four steps are shown correctly: head start distance (30 km), remaining gap (210 km), combined closing speed (140 km/h), and meeting time calculation yielding 11:00 AM exactly.
2026-07-05T05:11:02+00:00	zai-coding/glm-5	`code_debug`	2	1	—	65.3s	—	The model never returned a proper JSON object as required, instead providing lengthy stream-of-consciousness analysis. It identified the buggy line as 8 rather than the expected 7, and while the root
2026-07-05T05:11:33+00:00	zai-coding/glm-5	`summarize`	1	5	✓	47.7s	—	Response delivers exactly 5 bullets, each well under 20 words (range 10-15), covering the five required architecture-decision points (Decision, Reason, Trade-off, Metric, Question) with no fluff or fa
2026-07-05T05:12:09+00:00	zai-coding/glm-5	`reasoning_multistep`	2	5	✓	17.6s	—	All 4 steps are shown correctly with accurate calculations: head start distance (30 km), remaining distance (210 km), combined closing speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:
2026-07-05T05:11:42+00:00	zai-coding/glm-5	`creative_write`	1	0	—	44.5s	—	judge_parse_error: Let me carefully evaluate this response against all the criteria: 1. 270-330 words: The response includes the planning/drafting notes AND the story. The story itself, counting
2026-07-05T05:12:10+00:00	zai-coding/glm-5	`agentic_prompt`	1	5	✓	21.8s	—	Valid JSON output. Plan has exactly 5 items (within 4-6 range). Risks has exactly 4 items (within 2-4 range). first_action is a concrete, immediately runnable shell command with which/find/exiftool in
2026-07-05T05:11:49+00:00	zai-coding/glm-5	`creative_write`	2	5	✓	61.1s	—	The story meets all stated criteria: first-person present tense throughout, named protagonist (Sargeant), contains 'lamp', 'fog', and 'circuits', ends on the single dialogue line 'COME HOME', and word
2026-07-05T05:12:35+00:00	zai-coding/glm-4.7	`json_strict`	2	5	✓	34.8s	—	Response is valid JSON with exactly 4 objects, all containing the required 6 keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs (CMG001, AUR042, TKL100, BOK512) are uppercase 6-character
2026-07-05T05:12:35+00:00	zai-coding/glm-4.7	`json_strict`	1	5	✓	34.5s	—	Response is valid JSON with exactly 4 objects, all containing consistent keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6 characters (CMG001, AUR042, TKL100, BOK512), al
2026-07-05T05:12:14+00:00	zai-coding/glm-5	`agentic_prompt`	2	0	—	86.4s	—	The response never produces a JSON object — it is only truncated stream-of-consciousness deliberation ending mid-word ('pytho'). None of the required keys (plan, risks, first_action, dry_run_command)
2026-07-05T05:12:57+00:00	zai-coding/glm-4.7	`code_gen_long`	1	5	✓	43.1s	—	All criteria are met: the RateLimiter class compiles, has 3 methods (__init__, try_acquire, time_until_available) with proper type-annotated signatures, uses threading.Lock with context managers for t
2026-07-05T05:13:13+00:00	zai-coding/glm-4.7	`summarize`	1	4	—	32.8s	—	Format is correct: exactly 5 bullets, each well under 20 words (6-9 words each), concise with no fluff. Without the source material, full verification of factual accuracy and completeness of all 5 req
2026-07-05T05:13:13+00:00	zai-coding/glm-4.7	`code_gen_long`	2	5	✓	40.8s	—	All criteria are met: the RateLimiter class compiles, contains all three expected methods (__init__, try_acquire, time_until_available) with correct signatures, uses threading.Lock for thread safety,
2026-07-05T05:13:43+00:00	zai-coding/glm-4.7	`summarize`	2	4	✓	30.9s	—	Meets all format criteria perfectly: exactly 5 bullets, each well under 20 words, concise with no fluff. Content appears to cover modular monolith, deploy friction, microservices trade-off, monitoring
2026-07-05T05:13:58+00:00	zai-coding/glm-4.7	`reasoning_multistep`	1	5	✓	24.6s	—	All 4 steps are shown correctly: (1) Train A's head start distance = 30 km, (2) remaining distance = 210 km, (3) combined speed = 140 km/h, (4) time to meet = 1.5 hours, yielding 11:00 AM. Final answe
2026-07-05T05:14:19+00:00	zai-coding/glm-4.7	`reasoning_multistep`	2	5	✓	27.9s	—	All 4 steps are shown correctly with proper calculations: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). Fi
2026-07-05T05:12:27+00:00	zai-coding/glm-4.7	`code_debug`	2	3	—	157.0s	—	The fix correctly sorts by timestamp before extracting messages, and the root cause captures the alphabetical-sorting issue. However, the buggy_line is reported as 8 instead of the expected 7, which i
2026-07-05T05:15:10+00:00	mistral/mistral-large	`code_debug`	1	5	✓	1.5s	—	Correctly identifies buggy_line as 7, accurately explains root cause (sort key is message not timestamp), and provides a valid fix using `next()` with a generator expression to look up the timestamp f
2026-07-05T05:15:13+00:00	mistral/mistral-large	`code_debug`	2	5	✓	1.5s	—	All three criteria are met: buggy_line is correctly identified as 7, root_cause correctly explains the sort key is the message string instead of the timestamp, and the fix correctly sorts by the times
2026-07-05T05:15:17+00:00	mistral/mistral-large	`json_strict`	1	5	✓	3.7s	—	Valid JSON array with 4 complete objects. All SKUs (CMG001, AUR042, TKL100, BOK512) are 6 uppercase characters. All dates are in YYYY-MM-DD format. All tags are lowercase. All required keys (sku, name
2026-07-05T05:15:26+00:00	mistral/mistral-large	`json_strict`	2	5	✓	3.7s	—	All criteria are met: response is valid JSON, contains exactly 4 objects, all keys (sku, name, price_usd, in_stock, tags, added_on) are present consistently, SKUs are all 6 uppercase alphanumeric char
2026-07-05T05:15:34+00:00	mistral/mistral-large	`code_gen_long`	1	0	—	5.2s	—	no_response
2026-07-05T05:15:39+00:00	mistral/mistral-large	`code_gen_long`	2	0	—	4.9s	—	no_response
2026-07-05T05:15:44+00:00	mistral/mistral-large	`summarize`	1	0	—	5.9s	—	no_response
2026-07-05T05:13:45+00:00	zai-coding/glm-4.7	`creative_write`	1	0	—	119.6s	—	judge_parse_error: Let me evaluate this creative writing response against the criteria: 1. 270-330 words: Let me count the words in the response. "The fog presses against the glass panes, a solid
2026-07-05T05:15:50+00:00	mistral/mistral-large	`summarize`	2	0	—	5.2s	—	no_response
2026-07-05T05:15:53+00:00	mistral/mistral-large	`creative_write`	1	0	—	5.3s	—	no_response
2026-07-05T05:15:55+00:00	mistral/mistral-large	`creative_write`	2	0	—	5.4s	—	no_response
2026-07-05T05:15:59+00:00	mistral/mistral-large	`reasoning_multistep`	1	5	✓	5.5s	—	All 4 steps are shown correctly: distance calculation (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (11:00 AM). Final answer matches expected.
2026-07-05T05:16:00+00:00	mistral/mistral-large	`reasoning_multistep`	2	5	✓	4.9s	—	All 4 steps are shown correctly: head start distance (30 km), remaining distance (210 km), combined closing speed (140 km/h), and meeting time calculation yielding 11:00 AM.
2026-07-05T05:16:06+00:00	mistral/mistral-large	`agentic_prompt`	2	4	✓	8.2s	—	Valid JSON with 6 plan items (meets 4-6 range), 4 risks (meets 2-4 range), and a single-line dry_run_command. The first_action is concrete and runnable but is excessively verbose—it embeds an entire b
2026-07-05T05:16:06+00:00	mistral/mistral-large	`agentic_prompt`	1	5	✓	9.9s	—	All criteria are met: the response is valid JSON, the plan has 6 items (within 4-6 range), risks has 4 items (within 2-4 range), the first_action is a concrete runnable command creating the test direc
2026-07-05T05:12:27+00:00	zai-coding/glm-4.7	`code_debug`	1	0	—	240.2s	—	no_response
2026-07-05T05:16:22+00:00	mistral/mistral-medium	`code_debug`	2	5	✓	0.7s	—	The response correctly identifies line 7 as the buggy line, explains the root cause as sorting by message string instead of timestamp, and provides a fix that sorts by timestamp (the 'restructure to k
2026-07-05T05:16:20+00:00	mistral/mistral-medium	`code_debug`	1	4	✓	0.8s	—	buggy_line correctly identifies line 7, root_cause correctly identifies sort by message string instead of timestamp. The fix `errors.sort(key=lambda e: e['timestamp'])` assumes `errors` contains dicts
2026-07-05T05:16:29+00:00	mistral/mistral-medium	`json_strict`	2	5	✓	1.4s	—	Response is valid JSON containing exactly 4 objects. All objects have the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, B
2026-07-05T05:16:27+00:00	mistral/mistral-medium	`json_strict`	1	5	✓	4.2s	—	Response is valid JSON with exactly 4 objects, all keys present (sku, name, price_usd, in_stock, tags, added_on), all SKUs are 6 uppercase chars (CMG001, AUR042, TKL100, BOK512), all dates are in YYYY
2026-07-05T05:16:35+00:00	mistral/mistral-medium	`summarize`	1	5	✓	1.2s	—	Exactly 5 bullets, each well under 20 words, covering Decision, Reason, Trade-off, Metric, and Open question with no fluff or factual drift.
2026-07-05T05:16:30+00:00	mistral/mistral-medium	`code_gen_long`	1	5	✓	9.9s	—	The code compiles correctly, defines a RateLimiter class with __init__, try_acquire, and time_until_available methods with proper type hints, uses threading.Lock for thread safety, and includes a __ma
2026-07-05T05:14:48+00:00	zai-coding/glm-4.7	`agentic_prompt`	2	3	—	110.3s	—	Plan has 4 items and risks has 4 items (both within range). first_action is a concrete, immediately runnable heredoc. However, the dry_run_command is clearly truncated mid-line ('python3 /tmp/organize
2026-07-05T05:14:24+00:00	zai-coding/glm-4.7	`agentic_prompt`	1	2	—	133.2s	—	The JSON appears truncated (no closing brace visible) and the required 'dry_run_command' field is entirely missing, violating a key criterion. While plan (6 items) and risks (4 items) meet the count r
2026-07-05T05:13:52+00:00	zai-coding/glm-4.7	`creative_write`	2	0	—	164.4s	—	judge_parse_error: Let me analyze the model's response against the criteria: 1. Word count: 270-330 words Let me count... "I wipe the condensation from the windowpane, but the glass remains opaq
2026-07-05T05:16:42+00:00	mistral/mistral-medium	`summarize`	2	5	✓	0.9s	—	Response delivers exactly 5 bullets, each under 20 words, with clear structure (Decision, Reason, Trade-off, Metric, Open question). No fluff, concise, and each bullet is factually grounded with no dr
2026-07-05T05:16:47+00:00	mistral/mistral-medium	`reasoning_multistep`	2	5	✓	2.2s	—	All 4 steps are correctly shown: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). Final answer matches expected.
2026-07-05T05:16:44+00:00	mistral/mistral-medium	`reasoning_multistep`	1	5	✓	5.7s	—	All 4 steps are shown correctly: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). Final answer matches expected.
2026-07-05T05:16:35+00:00	mistral/mistral-medium	`code_gen_long`	2	5	✓	3.8s	—	The code meets all criteria: it compiles, contains 3 methods (__init__, try_acquire, time_until_available) with proper type hints, uses threading.Lock with 'with self.lock:' blocks for thread safety,
2026-07-05T05:16:43+00:00	mistral/mistral-medium	`creative_write`	1	0	—	3.6s	—	judge_parse_error: Let me evaluate this response against the criteria: 1. Word count: 270-330 words: Let me count. "I wake to the lamp's steady pulse, the fog pressed against the glass like a br
2026-07-05T05:16:48+00:00	mistral/mistral-medium	`agentic_prompt`	1	5	✓	5.3s	—	Valid JSON with all required fields. Plan has 5 items (within 4-6), risks has 4 items (within 2-4), first_action is a concrete heredoc command that immediately writes a runnable Python script, and dry
2026-07-05T05:16:52+00:00	mistral/devstral	`code_debug`	1	3	—	1.2s	—	Correctly identifies buggy_line as 7 and accurately states the root cause (sort key is message string instead of timestamp). However, the fix `errors.sort(key=lambda e: e['timestamp'])` is broken as w
2026-07-05T05:16:44+00:00	mistral/mistral-medium	`creative_write`	2	0	—	5.0s	—	judge_parse_error: Let me evaluate this response against the criteria: 1. Word count 270-330: Let me count the words in the response. "I wake to the lamp's steady pulse, the fog pressed against t
2026-07-05T05:16:54+00:00	mistral/devstral	`code_debug`	2	2	—	1.1s	—	root_cause correctly identifies that the sort key is the message string instead of the timestamp (matches expected). However, buggy_line is reported as 8 instead of 7. The fix `errors.sort(key=lambda
2026-07-05T05:16:51+00:00	mistral/mistral-medium	`agentic_prompt`	2	5	✓	5.0s	—	All criteria are met: valid JSON, plan has exactly 5 items (within 4-6 range), risks has exactly 4 items (within 2-4 range), first_action is a concrete heredoc that creates an immediately runnable Pyt
2026-07-05T05:16:57+00:00	mistral/devstral	`json_strict`	2	5	✓	3.4s	—	Valid JSON with exactly 4 objects, all containing sku/name/price_usd/in_stock/tags/added_on keys. All SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512). All dates are YYYY-MM-DD format. All
2026-07-05T05:17:00+00:00	mistral/devstral	`summarize`	1	5	✓	1.9s	—	Exactly 5 bullets, each well under 20 words (ranging 9-13 words). Captures all five standard architecture decision summary points (Decision, Reason, Trade-off, Metric, Open question) with concrete spe
2026-07-05T05:17:00+00:00	mistral/devstral	`summarize`	2	4	✓	2.2s	—	Format is perfect: exactly 5 bullets, all well under 20 words, no fluff. Covers a logical decision/reason/trade-off/metric/open-question structure. Cannot fully verify 'captures all 5 required points'
2026-07-05T05:16:54+00:00	mistral/devstral	`json_strict`	1	5	✓	10.1s	—	All criteria are met: valid JSON, exactly 4 objects, all SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, BOK512), all dates follow YYYY-MM-DD format, and all tags are lowercase.
2026-07-05T05:17:00+00:00	mistral/devstral	`code_gen_long`	2	5	✓	11.0s	—	All criteria are met: the class compiles, has 3 main methods (__init__, try_acquire, time_until_available) with proper signatures, uses threading.Lock for thread safety (self._lock with 'with' context
2026-07-05T05:17:07+00:00	mistral/devstral	`reasoning_multistep`	1	5	✓	4.9s	—	All 4 steps are correctly shown: head start distance (30 km), distance between trains at 9:30 AM (210 km), combined closing speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). The
2026-07-05T05:17:08+00:00	mistral/devstral	`reasoning_multistep`	2	5	✓	4.7s	—	All 4 steps are clearly shown with correct calculations: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and time to meet (1.5 hours after 9:30 AM = 11:00 AM). Final an
2026-07-05T05:16:59+00:00	mistral/devstral	`code_gen_long`	1	5	✓	10.9s	—	The code meets all stated criteria: the class compiles, includes all 3 required methods (__init__, try_acquire, time_until_available) with correct signatures, uses threading.Lock for thread safety, ha
2026-07-05T05:17:05+00:00	mistral/devstral	`creative_write`	1	0	—	6.6s	—	judge_parse_error: Let me evaluate this response against the criteria: 1. 270-330 words: Let me count... The response appears to be around 230-250 words. Let me recount carefully. "I wake to the
2026-07-05T05:17:14+00:00	mistral/devstral	`agentic_prompt`	1	5	✓	2.9s	—	All criteria are met: valid JSON output, plan contains exactly 5 items (within 4-6 range), risks contains exactly 4 items (within 2-4 range), first_action is a concrete immediately runnable shell comm
2026-07-05T05:17:15+00:00	mistral/devstral	`agentic_prompt`	2	5	✓	2.9s	—	Valid JSON with plan (5 items, within 4-6 range), risks (4 items, within 2-4 range), first_action is a concrete and immediately runnable cp command, and dry_run_command is a single shell line.
2026-07-05T05:17:07+00:00	mistral/devstral	`creative_write`	2	0	—	3.9s	—	judge_parse_error: Let me evaluate this creative writing piece against the criteria: 1. Word count: 270-330 words Let me count: "I wake to the lamp's steady pulse, the fog pressed against the gla
2026-07-05T05:17:15+00:00	ollama-cl/kimi-k2.5	`code_debug`	1	2	—	20.9s	—	The model correctly identifies the conceptual root cause (sorting alphabetically by message string instead of by timestamp), but is confused about line numbers, wavering between line 6, 8, and 9 when
2026-07-05T05:17:19+00:00	ollama-cl/kimi-k2.5	`json_strict`	2	0	—	22.8s	—	The response contains only the model's internal reasoning/planning text and never actually emits the required JSON array. It is truncated mid-construction ('name': 'Bl...') with no parseable JSON, no
2026-07-05T05:17:16+00:00	ollama-cl/kimi-k2.5	`code_debug`	2	3	✓	19.8s	—	The model correctly identifies the root cause (sort key is the message string instead of the timestamp) and proposes a valid fix (store tuples of (timestamp, message) then sort). However, it states th
2026-07-05T05:17:19+00:00	ollama-cl/kimi-k2.5	`json_strict`	1	2	—	22.4s	—	Response is truncated mid-stream after only 2 complete objects and a partial third ('sku' alone). Cannot verify the required 4 objects. Visible portions show correct SKU format (CMG001, AUR042 - 6 upp
2026-07-05T05:17:20+00:00	ollama-cl/kimi-k2.5	`code_gen_long`	2	5	✓	30.4s	—	The code meets all stated criteria: the RateLimiter class compiles cleanly, includes __init__, try_acquire, and time_until_available (plus a helper _add_tokens) with proper signatures, uses threading.
2026-07-05T05:17:20+00:00	ollama-cl/kimi-k2.5	`code_gen_long`	1	1	—	35.7s	—	The response is truncated mid-implementation, showing only a design discussion and the start of __init__/try_acquire methods. The class is incomplete: time_until_available method is missing, __main__
2026-07-05T05:17:43+00:00	ollama-cl/kimi-k2.5	`summarize`	1	4	—	11.0s	—	Meets the 5-bullet and ≤20-word constraints cleanly, but the final bullet is phrased as an open question ('...or implement immediately?') rather than a conclusive summary point, which is a minor devia
2026-07-05T05:17:45+00:00	ollama-cl/kimi-k2.5	`summarize`	2	5	✓	15.5s	—	Exactly 5 bullets, each under 20 words (all 9-11 words), captures concrete technical/strategic points (modular monolith, timeline/cost tradeoff, specific SLOs, service mesh deferral), no fluff or fill
2026-07-05T05:18:00+00:00	ollama-cl/kimi-k2.5	`reasoning_multistep`	2	5	✓	13.4s	—	All four steps are clearly shown and correct: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (11:00 AM). The verification confirms the answer.
2026-07-05T05:17:56+00:00	ollama-cl/kimi-k2.5	`reasoning_multistep`	1	5	✓	17.6s	—	All 4 steps are shown correctly with accurate calculations: Train A's 30 km head start, 210 km remaining distance, 140 km/h combined speed, and 1.5 hours to meet resulting in 11:00 AM. The final answe
2026-07-05T05:17:47+00:00	ollama-cl/kimi-k2.5	`creative_write`	2	0	—	27.6s	—	The response contains only the model's internal chain-of-thought and drafting notes, not the requested creative story. The actual output cuts off mid-word-count check and never delivers a complete 270
2026-07-05T05:17:46+00:00	ollama-cl/kimi-k2.5	`creative_write`	1	1	—	27.2s	—	The response is fatally compromised by visible meta-commentary and drafting notes ('Drafting:', 'Wait, that's 236 words...', 'Word count check: 1. The 2. lamp 3. tur') that should not appear in the fi
2026-07-05T05:18:00+00:00	ollama-cl/kimi-k2.5	`agentic_prompt`	1	1	—	42.7s	—	The response is truncated mid-sentence at 'Execute' in the plan array, making the JSON invalid and incomplete. Cannot verify risks (2-4 items), first_action (concrete/runnable), or dry_run_command (si
2026-07-05T05:18:03+00:00	ollama-cl/kimi-k2.5	`agentic_prompt`	2	0	—	38.5s	—	The model never produced the required JSON object. Instead, it output meta-reasoning/brainstorming text that was truncated mid-sentence. None of the four required keys (plan, risks, first_action, dry_
2026-07-05T05:18:14+00:00	ollama-cl/nemotron-3-ultra	`code_debug`	1	5	✓	40.5s	—	The response correctly identifies line 7 (`errors.sort(key=lambda m: m)`) as the buggy line, accurately explains that the sort key uses the message string instead of the timestamp (causing alphabetica
2026-07-05T05:18:15+00:00	ollama-cl/nemotron-3-ultra	`code_debug`	2	5	✓	68.4s	—	The response correctly identifies the buggy line (line 7: `errors.sort(key=lambda m: m)`), explains the root cause (sort key is the message string instead of timestamp, causing alphabetical rather tha
2026-07-05T05:18:47+00:00	ollama-cl/nemotron-3-ultra	`code_gen_long`	2	5	✓	58.9s	—	The code compiles, defines a RateLimiter class with all 3 required methods (__init__, try_acquire, time_until_available) having proper signatures with type hints, uses threading.Lock correctly (acquir
2026-07-05T05:18:20+00:00	ollama-cl/nemotron-3-ultra	`json_strict`	1	5	✓	102.7s	—	Response is valid JSON with exactly 4 objects. All required keys present. SKUs (CMG001, AUR042, TKL100, BOK512) are 6 uppercase chars. Dates are in YYYY-MM-DD format. All tags are lowercase.
2026-07-05T05:18:25+00:00	ollama-cl/nemotron-3-ultra	`json_strict`	2	5	✓	118.2s	—	Valid JSON array with 4 objects, all having consistent keys (sku, name, price_usd, in_stock, tags, added_on). SKUs are all 6 uppercase chars (CMG001, AUR042, TKL100, BOK512), dates are in YYYY-MM-DD f
2026-07-05T05:18:57+00:00	ollama-cl/nemotron-3-ultra	`summarize`	1	5	✓	94.6s	—	Exactly 5 bullets, each under 20 words, all information-dense with no fluff. Bullets coherently cover architecture choice, problem diagnosis, cost/timeline, target metrics, and deferred decision. Inte
2026-07-05T05:19:26+00:00	ollama-cl/nemotron-3-ultra	`summarize`	2	5	—	77.9s	—	Exactly 5 bullets, each well under 20 words, concise and information-dense with no fluff. Captures decision rationale, problem quantification, trade-offs, specific targets/metrics, and open question —
2026-07-05T05:20:28+00:00	ollama-cl/nemotron-3-ultra	`reasoning_multistep`	1	5	✓	33.6s	—	All 4 steps are shown correctly with proper calculations: 30 km head start, 210 km remaining distance, 140 km/h closing speed, and 1.5 hours to meet resulting in 11:00 AM. Final answer matches expecte
2026-07-05T05:18:46+00:00	ollama-cl/nemotron-3-ultra	`code_gen_long`	1	5	✓	152.8s	—	The code compiles correctly, defines a RateLimiter class with __init__, try_acquire, and time_until_available methods with proper signatures, uses threading.Lock for thread safety, has comprehensive t
2026-07-05T05:20:38+00:00	ollama-cl/nemotron-3-ultra	`reasoning_multistep`	2	5	✓	77.1s	—	All four steps are shown correctly: head start distance (30 km), remaining distance (210 km), combined closing speed (140 km/h), and meeting time calculation yielding 11:00 AM. Final answer matches ex
2026-07-05T05:21:23+00:00	qwen3.6:27b	`code_debug`	1	0	—	105.5s	—	no_response
2026-07-05T05:20:49+00:00	ollama-cl/nemotron-3-ultra	`agentic_prompt`	1	5	✓	138.6s	—	Valid JSON with plan (6 items, within 4-6 range), risks (4 items, within 2-4 range), a concrete and immediately runnable first_action (heredoc creating the script), and a single-line dry_run_command.
2026-07-05T05:21:03+00:00	ollama-cl/nemotron-3-ultra	`agentic_prompt`	2	5	✓	145.0s	—	Response is valid JSON, plan has exactly 6 items (within 4-6), risks has exactly 4 items (within 2-4), first_action 'ls -la /var/lib/chaos/ \| head -20' is concrete and immediately runnable, and dry_ru
2026-07-05T05:19:52+00:00	ollama-cl/nemotron-3-ultra	`creative_write`	1	0	—	240.1s	—	no_response
2026-07-05T05:20:07+00:00	ollama-cl/nemotron-3-ultra	`creative_write`	2	0	—	240.1s	—	no_response
2026-07-05T05:21:57+00:00	qwen3.6:27b	`code_debug`	2	0	—	138.8s	—	no_response
2026-07-05T05:23:08+00:00	qwen3.6:27b	`json_strict`	1	0	—	135.8s	—	no_response
2026-07-05T05:23:13+00:00	qwen3.6:27b	`json_strict`	2	0	—	198.7s	—	no_response
2026-07-05T05:23:32+00:00	qwen3.6:27b	`code_gen_long`	1	0	—	240.1s	—	no_response
2026-07-05T05:23:52+00:00	qwen3.6:27b	`code_gen_long`	2	0	—	240.1s	—	no_response
2026-07-05T05:24:07+00:00	qwen3.6:27b	`summarize`	1	0	—	240.2s	—	no_response
2026-07-05T05:24:16+00:00	qwen3.6:27b	`summarize`	2	0	—	240.2s	—	no_response
2026-07-05T05:25:24+00:00	qwen3.6:27b	`creative_write`	1	0	—	240.1s	—	no_response
2026-07-05T05:26:32+00:00	qwen3.6:27b	`creative_write`	2	0	—	240.2s	—	no_response
2026-07-05T05:27:32+00:00	qwen3.6:27b	`reasoning_multistep`	1	0	—	240.1s	—	no_response
2026-07-05T05:27:52+00:00	qwen3.6:27b	`reasoning_multistep`	2	0	—	240.2s	—	no_response
2026-07-05T05:28:07+00:00	qwen3.6:27b	`agentic_prompt`	1	0	—	240.1s	—	no_response
2026-07-05T05:28:16+00:00	qwen3.6:27b	`agentic_prompt`	2	0	—	240.1s	—	no_response
2026-07-05T05:23:44+00:00	claude-fable	`code_debug`	2	3	—	18.6s	$0.0012	The root_cause correctly identifies that the sort key is the message string instead of the timestamp, and the fix correctly sorts by timestamp. However, the buggy_line is reported as 8 instead of the
2026-07-05T05:23:44+00:00	claude-fable	`code_debug`	1	4	✓	22.4s	$0.0012	The root_cause correctly identifies that the sort key is the message string rather than the timestamp, resulting in alphabetical instead of chronological ordering. The fix correctly sorts by the corre
2026-07-05T05:24:09+00:00	claude-fable	`json_strict`	1	5	✓	9.6s	$0.0010	Response is valid JSON with exactly 4 objects, all containing required keys. SKUs (CMG001, AUR042, TKL100, BOK512) are uppercase 6-character codes, dates follow YYYY-MM-DD format, and all tags are low
2026-07-05T05:24:12+00:00	claude-fable	`json_strict`	2	5	✓	12.3s	$0.0010	Response is valid JSON containing exactly 4 objects. All objects have consistent keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-character codes (CMG001, AUR042, TKL100
2026-07-05T05:24:23+00:00	claude-fable	`code_gen_long`	1	5	✓	14.1s	$0.0008	All criteria met: the class compiles, has 3 methods (__init__, try_acquire, time_until_available) with proper type hints and signatures, uses threading.Lock with context manager, has type hints throug
2026-07-05T05:24:29+00:00	claude-fable	`code_gen_long`	2	5	✓	15.7s	$0.0008	Class compiles correctly, includes __init__, try_acquire, and time_until_available (3 methods) with proper type hints and signatures, uses threading.Lock with context manager in both public methods, a
2026-07-05T05:24:42+00:00	claude-fable	`summarize`	1	5	✓	7.0s	$0.0011	Exactly 5 bullets provided, each well under 20 words (max ~14), no fluff, captures Decision/Reason/Trade-off/Metrics/Open question with concrete specifics.
2026-07-05T05:24:50+00:00	claude-fable	`summarize`	2	5	✓	12.2s	$0.0011	Exactly 5 bullets, each within 20 words (10–15 words each), covering 5 distinct required points (Decision, Reason, Trade-off, Metrics, Open question) with concise, non-redundant content and no factual
2026-07-05T05:24:53+00:00	claude-fable	`creative_write`	1	0	—	34.2s	$0.0008	judge_parse_error: Let me evaluate the response against the criteria: 1. Word count: 270-330 words Let me count... "The fog has been on the water for three days when the receiver starts its chat
2026-07-05T05:25:37+00:00	claude-fable	`reasoning_multistep`	1	5	✓	7.4s	$0.0008	All 4 steps are shown correctly with clear arithmetic: head start (30 km), gap (210 km), combined speed (140 km/h), and meeting time (1.5 h). Final answer 11:00 AM matches the expected answer, and the
2026-07-05T05:25:47+00:00	claude-fable	`reasoning_multistep`	2	5	✓	10.0s	$0.0008	All 4 steps are shown correctly: Step 1 calculates 30 km head start, Step 2 computes 210 km gap, Step 3 uses combined speed 140 km/h, Step 4 divides to get 1.5 hours arriving at 11:00 AM. Final answer
2026-07-05T05:25:06+00:00	claude-fable	`creative_write`	2	5	✓	44.1s	$0.0008	Meets all criteria: ~283 words (within 270-330), first-person present tense throughout, named protagonist Aldous Finn, ends on a single line of dialogue ('Still here, love. Send again.'), and contains
2026-07-05T05:26:00+00:00	claude-fable	`agentic_prompt`	2	5	✓	18.6s	$0.0009	All criteria are met: valid JSON, plan has exactly 6 items (within 4-6), risks has exactly 4 items (within 2-4), first_action is a concrete single-shell-pipeline that is immediately runnable for EXIF
2026-07-05T05:25:58+00:00	claude-fable	`agentic_prompt`	1	5	✓	25.6s	$0.0009	Valid JSON object with all required keys. Plan has exactly 6 items (within 4-6 range), risks has exactly 4 items (within 2-4 range), first_action is a concrete, immediately runnable shell pipeline usi
2026-07-05T05:06:24+00:00	codex-gpt-5.5	`json_strict`	2	5	✓	10.0s	—	All criteria met: valid JSON, exactly 4 objects with consistent keys (sku, name, price_usd, in_stock, tags, added_on), all SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), all dates in YYY
2026-07-05T05:06:24+00:00	codex-gpt-5.5	`code_debug`	2	3	—	12.5s	—	The root_cause correctly identifies that messages are sorted alphabetically instead of by timestamp, and the fix correctly sorts qualifying entries by timestamp before extracting messages. However, bu
2026-07-05T05:06:24+00:00	codex-gpt-5.5	`json_strict`	1	5	✓	13.5s	—	Response is valid JSON with 4 objects, each containing all required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-character format (CMG001, AUR042, TKL100, BOK512), a
2026-07-05T05:06:24+00:00	codex-gpt-5.5	`code_debug`	1	2	—	18.8s	—	The buggy line is identified as 8 instead of the expected line 7. The root cause explanation is too vague—it says timestamps were 'discarded' but doesn't specifically identify that the sort key is the
2026-07-05T05:06:40+00:00	codex-gpt-5.5	`summarize`	1	5	✓	4.6s	—	Exactly 5 bullets, each well under 20 words (longest is ~15 words). Each bullet captures a distinct required point (decision, reason, trade-off, evaluation criteria, open question) with no fluff or ex
2026-07-05T05:06:40+00:00	codex-gpt-5.5	`code_gen_long`	2	5	✓	14.7s	—	The code compiles, defines a RateLimiter class with __init__, try_acquire, and time_until_available methods having proper type hints and correct signatures, uses threading.Lock for thread safety via '
2026-07-05T05:06:49+00:00	codex-gpt-5.5	`summarize`	2	2	—	4.0s	—	Format compliance is correct (exactly 5 bullets, each ≤20 words), but the response appears to fabricate content about an unspecified topic rather than summarize given material. The last bullet is an o
2026-07-05T05:07:01+00:00	codex-gpt-5.5	`reasoning_multistep`	1	5	✓	6.8s	—	All 4 steps are shown correctly with proper calculations: head start distance (30 km), remaining gap (210 km), combined closing speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM).
2026-07-05T05:06:49+00:00	codex-gpt-5.5	`creative_write`	1	0	—	19.5s	—	judge_parse_error: Let me evaluate the model's response against the criteria: 1. Word count: 270-330 words Let me count the words: "My name is Elias Ward, and I keep the light on Saint Oran, a bl
2026-07-05T05:07:09+00:00	codex-gpt-5.5	`reasoning_multistep`	2	5	✓	6.8s	—	All 4 steps are shown correctly: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time calculation yielding 11:00 AM. Final answer matches expected.
2026-07-05T05:06:39+00:00	codex-gpt-5.5	`code_gen_long`	1	2	—	34.0s	—	The response describes implementing a RateLimiter class but never actually shows the code itself. While it claims verification with py_compile and demonstrates output, the actual source code is not vi
2026-07-05T05:07:00+00:00	codex-gpt-5.5	`creative_write`	2	5	✓	18.8s	—	Meets all criteria: ~314 words (within 270-330), first-person present throughout ('I keep', 'I trim', 'I climb'), named protagonist 'Elias Ward', ends on single dialogue line ('Tell me which one, Mara
2026-07-05T05:07:18+00:00	codex-gpt-5.5	`agentic_prompt`	2	5	✓	12.2s	—	Valid JSON, plan has 6 items (within 4-6 range), risks has 4 items (within 2-4 range), first_action is a concrete immediately runnable shell command checking exiftool availability and counting JPG fil
2026-07-05T05:07:19+00:00	codex-gpt-5.4	`code_debug`	1	3	—	12.1s	—	The root_cause correctly identifies that messages are sorted alphabetically (by message string) instead of chronologically by timestamp. The fix correctly sorts log_entries by timestamp before extract
2026-07-05T05:07:35+00:00	codex-gpt-5.4	`json_strict`	1	5	✓	5.8s	—	Valid JSON array with exactly 4 objects, all containing the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), all dates a
2026-07-05T05:07:37+00:00	codex-gpt-5.4	`json_strict`	2	5	✓	7.6s	—	Response is valid JSON with exactly 4 objects, all sharing consistent keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), all dates
2026-07-05T05:07:29+00:00	codex-gpt-5.4	`code_debug`	2	4	✓	10.0s	—	Root cause correctly identifies sorting by message strings alphabetically instead of by timestamp, and the fix is valid (restructures to sort entries by timestamp then extract messages). However, the
2026-07-05T05:07:17+00:00	codex-gpt-5.5	`agentic_prompt`	1	2	—	33.1s	—	The response is missing the required 'dry_run_command' field entirely; only 'plan', 'risks', and 'first_action' are provided. While the plan (5 items) and risks (4 items) meet the count criteria and f

Showing first 200 trials. Full data: data/all_trials.json in the repo. Download as CSV (973 rows, includes 1P-est per trial). Per-trial text files: /raw/trials/ in the deployed site.