$ raw
Raw Data
Every trial, raw
All 973 trials. Use the filters to narrow by model, task, or score. The full response text is in the per-trial files at /raw/trials/. The COST column is the wire-level cost from the provider (e.g. $0.0000 for skynet routes that bill on a subscription tier rather than per-call); the equivalent apples-to-apples public-API cost is shown in the /results/ leaderboard's 1P-EST column.
| TS | MODEL | TASK | REP | SCORE | PASS | WALL | COST (wire) | JUDGE |
|---|---|---|---|---|---|---|---|---|
| 2026-07-05T05:05:28+00:00 | minimax-m3 | json_strict | 2 | 5 | ✓ | 5.6s | — | The response is valid JSON containing exactly 4 objects, all with the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-character codes (CMG001, AUR042, TKL100, |
| 2026-07-05T05:05:28+00:00 | minimax-m3 | json_strict | 1 | 5 | ✓ | 7.8s | — | Response is valid JSON, contains exactly 4 objects, all required keys present (sku, name, price_usd, in_stock, tags, added_on), all SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, BOK512), al |
| 2026-07-05T05:05:28+00:00 | minimax-m3 | code_debug | 1 | 3 | — | 15.0s | — | Root cause correctly identifies that the sort key is the message string instead of timestamp, and the fix correctly sorts by timestamp. However, the buggy_line is reported as 10 instead of the expecte |
| 2026-07-05T05:05:40+00:00 | minimax-m3 | summarize | 2 | 4 | ✓ | 7.6s | — | Response delivers exactly 5 bullets, all within 20-word limits (17, 16, 16, 13, 19 words), with no fluff. Covers distinct dimensions (decision, rationale, trade-off, metrics, open question) appropriat |
| 2026-07-05T05:05:28+00:00 | minimax-m3 | code_gen_long | 2 | 1 | — | 23.2s | — | The response is truncated mid-code at 'd' (likely starting 'def _refill'), so the actual deliverable class is not visible. While the planning demonstrates correct understanding (threading.Lock, type h |
| 2026-07-05T05:05:28+00:00 | minimax-m3 | code_gen_long | 1 | 3 | — | 23.3s | — | The class implementation is well-designed: it uses threading.Lock, has full type hints (including Final), includes all 3 expected methods (__init__, try_acquire, time_until_available) with correct sig |
| 2026-07-05T05:05:38+00:00 | minimax-m3 | summarize | 1 | 0 | — | 8.7s | — | judge_parse_error: ```json { "score": 3, "reason": "Format is correct (exactly 5 bullets, each ≤20 words), but the 5th bullet is |
| 2026-07-05T05:05:57+00:00 | minimax-m3 | reasoning_multistep | 1 | 5 | ✓ | 7.1s | — | All four steps are clearly shown and correct: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and meeting time (11:00 AM). Final answer matches expected. |
| 2026-07-05T05:05:59+00:00 | minimax-m3 | reasoning_multistep | 2 | 5 | ✓ | 5.2s | — | All four steps are clearly shown with correct calculations, and the final answer of 11:00 AM matches the expected answer. Verification step also confirms correctness. |
| 2026-07-05T05:05:28+00:00 | minimax-m3 | code_debug | 2 | 3 | — | 33.1s | — | The root_cause correctly identifies that the sort key uses the message string instead of the timestamp, and the fix correctly sorts by timestamp. However, the buggy_line is reported as 8 instead of th |
| 2026-07-05T05:05:47+00:00 | minimax-m3 | creative_write | 1 | 0 | — | 26.1s | — | judge_parse_error: Let me carefully evaluate the model's response against the criteria: 1. **Word count 270-330**: The actual story content starts at "I count the wick trim by lamp..." and ends at ". |
| 2026-07-05T05:05:55+00:00 | minimax-m3 | creative_write | 2 | 1 | — | 26.5s | — | The response is a working draft filled with meta-commentary, planning notes, and multiple unfinished versions rather than a complete story. The protagonist is never named (only the deceased wife 'Marg |
| 2026-07-05T05:06:05+00:00 | minimax-m3 | agentic_prompt | 1 | 5 | ✓ | 20.5s | — | Valid JSON with plan (5 items, within 4-6), risks (4 items, within 2-4), a concrete and immediately runnable first_action (ls/find/version checks), and a dry_run_command that is a single shell line (u |
| 2026-07-05T05:06:05+00:00 | minimax-m3 | agentic_prompt | 2 | 3 | — | 22.8s | — | Plan has 5 items, risks has 4 items, and first_action is concrete and runnable — all good. However, the dry_run_command contains a literal newline (`\n` in the JSON string resolves to an actual newlin |
| 2026-07-05T05:06:20+00:00 | minimax-m2.7 | json_strict | 1 | 5 | ✓ | 25.9s | — | Response is valid JSON with 4 objects, all containing required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-char (CMG001, AUR042, TKL100, BOK512), all dates are YYYY |
| 2026-07-05T05:06:26+00:00 | minimax-m2.7 | json_strict | 2 | 5 | ✓ | 34.3s | — | The response is valid JSON containing exactly 4 objects. All keys (sku, name, price_usd, in_stock, tags, added_on) are present in each object. All SKUs are uppercase 6-character strings (CMG001, AUR04 |
| 2026-07-05T05:06:55+00:00 | minimax-m2.7 | summarize | 1 | 4 | ✓ | 17.6s | — | Response provides exactly 5 bullets, each well under 20 words, with clear, terse phrasing and no fluff. Structure (recommendation, benefit, tradeoff, KPIs, open question) is coherent and decision-orie |
| 2026-07-05T05:06:05+00:00 | minimax-m2.7 | code_debug | 1 | 3 | — | 79.4s | — | The model correctly identifies the buggy line (`errors.sort(key=lambda m: m)`) and accurately explains the root cause (sort key is the message string, losing timestamp info) and provides valid fixes ( |
| 2026-07-05T05:07:08+00:00 | minimax-m2.7 | summarize | 2 | 5 | ✓ | 18.6s | — | Exactly 5 bullets provided, each under 20 words (ranging from 10-18 words), with no fluff and clear capture of 5 distinct points: recommendation, benefits, trade-offs, metrics, and open question. Form |
| 2026-07-05T05:06:06+00:00 | minimax-m2.7 | code_debug | 2 | 1 | — | 87.9s | — | The response correctly identifies the conceptual bug (sort key is message string instead of timestamp) but fails to produce a clean JSON output. It claims the buggy line is 8 rather than the correct l |
| 2026-07-05T05:06:30+00:00 | minimax-m2.7 | code_gen_long | 1 | 3 | — | 76.7s | — | The RateLimiter class is well-implemented with correct signatures, threading.Lock usage, and comprehensive type hints. However, the __main__ demo is truncated mid-line ('Exhaust tokens: try to acquire |
| 2026-07-05T05:06:43+00:00 | minimax-m2.7 | code_gen_long | 2 | 5 | ✓ | 69.4s | — | The code defines a RateLimiter class with proper __init__, try_acquire, and time_until_available methods (with helper _refill). It uses threading.Lock for thread safety (both public methods acquire th |
| 2026-07-05T05:07:22+00:00 | minimax-m2.7 | creative_write | 1 | 0 | — | 34.1s | — | judge_parse_error: Let me evaluate the response against the criteria: 1. **270-330 words**: Let me count... "The lamp hums its steady vigil as I climb the spiral stairs, each step worn smooth by dec |
| 2026-07-05T05:07:31+00:00 | minimax-m2.7 | reasoning_multistep | 1 | 5 | ✓ | 34.4s | — | All 4 steps are correctly shown: head start distance (30 km), separation (210 km), combined speed (140 km/h), time to meet (1.5 h), and final answer 11:00 AM matches expected. |
| 2026-07-05T05:07:38+00:00 | minimax-m2.7 | reasoning_multistep | 2 | 5 | ✓ | 33.1s | — | All four steps are shown correctly with proper reasoning: distance calculation, gap calculation, combined closing speed, and time-to-meet computation. Final answer of 11:00 AM matches the expected ans |
| 2026-07-05T05:08:08+00:00 | minimax-m2.7-highspeed | code_debug | 1 | 4 | ✓ | 11.6s | — | Correctly identifies line 7 as the buggy line, accurately explains that the sort key is the message string instead of the timestamp (causing alphabetical instead of chronological ordering), and propos |
| 2026-07-05T05:07:54+00:00 | minimax-m2.7 | agentic_prompt | 1 | 5 | ✓ | 29.4s | — | Response is valid JSON with plan (6 items, within 4-6 range), risks (4 items, within 2-4 range), a concrete and immediately runnable first_action ('which exiftool && python3 --version'), and a single- |
| 2026-07-05T05:08:00+00:00 | minimax-m2.7 | agentic_prompt | 2 | 5 | ✓ | 25.7s | — | Response is valid JSON. Plan has 5 items (within 4-6 range), risks has 4 items (within 2-4 range), first_action ('exiftool -ver && python3 ...') is concrete and immediately runnable, and dry_run_comma |
| 2026-07-05T05:08:14+00:00 | minimax-m2.7-highspeed | json_strict | 1 | 5 | ✓ | 16.0s | — | Valid JSON with exactly 4 objects, all 6 keys present in each, SKUs are uppercase 6-character (CMG001, AUR042, TKL100, BOK512), dates are in YYYY-MM-DD format, and all tags are lowercase including hyp |
| 2026-07-05T05:08:33+00:00 | minimax-m2.7-highspeed | summarize | 1 | 3 | — | 8.5s | — | Meets format criteria (5 bullets, each ≤20 words) and content is technically coherent on modular monolith vs microservices. However, the 5th bullet is phrased as an open question rather than a conclus |
| 2026-07-05T05:07:30+00:00 | minimax-m2.7 | creative_write | 2 | 0 | — | 80.0s | — | The response contains only planning notes and outline reasoning about how to write the story, but never actually produces the creative writing piece itself. None of the criteria (word count, first-per |
| 2026-07-05T05:08:27+00:00 | minimax-m2.7-highspeed | code_gen_long | 1 | 5 | ✓ | 27.5s | — | The code meets all stated criteria: class compiles, has 3 methods (__init__, try_acquire, time_until_available) with correct signatures, uses threading.Lock via threading.Lock() with context manager, |
| 2026-07-05T05:08:24+00:00 | minimax-m2.7-highspeed | json_strict | 2 | 5 | ✓ | 34.9s | — | All criteria are satisfied: the response parses as a valid JSON array of exactly 4 objects, all objects share consistent keys (sku, name, price_usd, in_stock, tags, added_on), all SKUs are uppercase 6 |
| 2026-07-05T05:08:49+00:00 | minimax-m2.7-highspeed | summarize | 2 | 4 | — | 7.1s | — | Format is perfect: exactly 5 bullets, each under 20 words (longest is 12 words), no fluff, concise. Content covers decision, benefits/cost trade-off, scalability trade-off, KPIs, and open question—app |
| 2026-07-05T05:08:09+00:00 | minimax-m2.7-highspeed | code_debug | 2 | 2 | — | 51.2s | — | The model correctly identifies the buggy code as the sort line using message string instead of timestamp, and provides valid fix options (storing tuples or entries). However, it fails to return the re |
| 2026-07-05T05:08:31+00:00 | minimax-m2.7-highspeed | code_gen_long | 2 | 5 | ✓ | 40.1s | — | Class is syntactically valid Python, contains all 3 methods (__init__, try_acquire, time_until_available) with proper type hints on parameters and return types, uses threading.Lock with context manage |
| 2026-07-05T05:09:06+00:00 | minimax-m2.7-highspeed | reasoning_multistep | 2 | 5 | ✓ | 17.1s | — | All 4 steps are shown correctly: head start distance (30 km), remaining gap (210 km), closing speed (140 km/h), and time to meet (1.5 hours from 9:30 AM = 11:00 AM). Final answer matches expected. |
| 2026-07-05T05:09:04+00:00 | minimax-m2.7-highspeed | reasoning_multistep | 1 | 5 | ✓ | 20.0s | — | All 4 steps are shown correctly: head start distance (30 km), remaining gap (210 km), closing speed (140 km/h), and meeting time (1.5 h from 9:30 AM). The final answer of 11:00 AM matches the expected |
| 2026-07-05T05:08:54+00:00 | minimax-m2.7-highspeed | creative_write | 1 | 0 | — | 28.7s | — | judge_parse_error: Let me evaluate the response against the criteria: 1. **Word count 270-330 words**: Let me count. "The Wavelength" - title, then the text starts with "My name is Ellis Vane..." Let |
| 2026-07-05T05:09:07+00:00 | minimax-m2.7-highspeed | agentic_prompt | 1 | 5 | ✓ | 27.4s | — | Valid JSON with all required fields. Plan has exactly 6 items (within 4-6 range), risks has 3 items (within 2-4 range), first_action is a concrete executable command checking tool availability, and dr |
| 2026-07-05T05:09:33+00:00 | zai-coding/glm-5.2 | json_strict | 1 | 5 | ✓ | 6.0s | — | The response is valid JSON with exactly 4 objects, all containing the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are 6 uppercase chars (CMG001, AUR042, TKL100, BOK512), a |
| 2026-07-05T05:08:58+00:00 | minimax-m2.7-highspeed | creative_write | 2 | 1 | — | 46.1s | — | The response massively exceeds the 270-330 word limit (appears 500+ words), the protagonist is never named (only the dead wife 'Elaine' is named), the piece degenerates into repetitive loops and is tr |
| 2026-07-05T05:09:38+00:00 | zai-coding/glm-5.2 | json_strict | 2 | 5 | ✓ | 8.1s | — | The response is valid JSON with exactly 4 objects, all containing the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, BOK51 |
| 2026-07-05T05:09:42+00:00 | zai-coding/glm-5.2 | code_gen_long | 1 | 5 | ✓ | 19.4s | — | The code defines a RateLimiter class that compiles, has all required methods (__init__, try_acquire, time_until_available) with proper signatures and type hints, uses threading.Lock for thread safety |
| 2026-07-05T05:09:15+00:00 | minimax-m2.7-highspeed | agentic_prompt | 2 | 0 | — | 49.0s | — | The response is entirely internal deliberation/planning and never produces the required valid JSON object with the specified keys (plan, risks, first_action, dry_run_command). None of the criteria — v |
| 2026-07-05T05:09:49+00:00 | zai-coding/glm-5.2 | code_gen_long | 2 | 5 | ✓ | 14.6s | — | The class compiles, provides all three methods (__init__, try_acquire, time_until_available) with correct signatures and type hints, uses threading.Lock for thread-safety in both public methods, and i |
| 2026-07-05T05:09:50+00:00 | zai-coding/glm-5.2 | summarize | 1 | 4 | — | 26.9s | — | Response meets the structural criteria: exactly 5 bullets, each ≤20 words (ranging from 11-14 words), with clear categorical labels (Decision, Reason, Trade-off, Metric, Open Question) and no fluff. C |
| 2026-07-05T05:09:25+00:00 | zai-coding/glm-5.2 | code_debug | 1 | 2 | — | 55.0s | — | The root cause is correctly identified (sort key is message string, not timestamp, causing alphabetical instead of chronological order). However, the model incorrectly identifies the buggy line as lin |
| 2026-07-05T05:09:26+00:00 | zai-coding/glm-5.2 | code_debug | 2 | 1 | — | 62.6s | — | The model identifies the sort-by-message issue as the root cause, but incorrectly labels the buggy line as 8 instead of 7 as specified in the criteria. More critically, the response is truncated and n |
| 2026-07-05T05:10:22+00:00 | zai-coding/glm-5.2 | reasoning_multistep | 1 | 5 | ✓ | 12.0s | — | All 4 steps are shown and correctly computed: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (1.5 hours). Final answer is 11:00 AM as expected. |
| 2026-07-05T05:10:29+00:00 | zai-coding/glm-5.2 | reasoning_multistep | 2 | 5 | ✓ | 14.4s | — | All 4 steps are correctly shown: (1) head start distance = 30 km, (2) remaining distance = 210 km, (3) closing speed = 140 km/h, (4) time to meet = 1.5 hours, giving final answer of 11:00 AM which mat |
| 2026-07-05T05:10:05+00:00 | zai-coding/glm-5.2 | summarize | 2 | 5 | ✓ | 52.4s | — | Exactly 5 bullets provided, each under 20 words (longest is 14 words), covering all 5 required points (Decision, Reason, Trade-off, Metrics, Open Question) with concrete, specific details and no fluff |
| 2026-07-05T05:10:37+00:00 | zai-coding/glm-5.2 | agentic_prompt | 1 | 5 | ✓ | 27.1s | — | All criteria are met: valid JSON, plan has 5 items (within 4-6), risks has 4 items (within 2-4), first_action is a concrete, immediately runnable shell pipeline for env probing and sampling, and dry_r |
| 2026-07-05T05:10:10+00:00 | zai-coding/glm-5.2 | creative_write | 2 | 5 | ✓ | 55.1s | — | Story meets all criteria: approximately 290 words (within 270-330), first-person present tense throughout ('I climb', 'I set down', etc.), named protagonist 'Ellen Voss', ends on single line of dialog |
| 2026-07-05T05:10:10+00:00 | zai-coding/glm-5.2 | creative_write | 1 | 0 | — | 53.6s | — | judge_parse_error: The user wants me to judge a creative writing task. Let me evaluate the response against the criteria: 1. **270-330 words**: The response got cut off with the model doing word coun |
| 2026-07-05T05:11:12+00:00 | zai-coding/glm-5 | json_strict | 1 | 5 | ✓ | 15.1s | — | The response is a valid JSON array with exactly 4 objects, each containing all 6 required keys. All SKUs are uppercase 6-character strings (CMG001, AUR042, TKL100, BOK512), all dates follow YYYY-MM-DD |
| 2026-07-05T05:10:37+00:00 | zai-coding/glm-5.2 | agentic_prompt | 2 | 4 | — | 45.4s | — | Plan has 6 items (within 4-6 range), risks has 4 items (within 2-4 range), first_action is a concrete bash heredoc that is immediately runnable, and the JSON is valid. However, the 'dry_run_command' f |
| 2026-07-05T05:11:16+00:00 | zai-coding/glm-5 | json_strict | 2 | 5 | ✓ | 21.4s | — | Valid JSON with 4 objects, all required keys present, SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), dates in YYYY-MM-DD format, and all tags are lowercase. All criteria fully satisfied. |
| 2026-07-05T05:11:17+00:00 | zai-coding/glm-5 | code_gen_long | 1 | 5 | ✓ | 19.4s | — | The code compiles cleanly, defines a `RateLimiter` class with three well-typed methods (`__init__`, `try_acquire`, `time_until_available`), uses `threading.Lock` correctly via context managers in all |
| 2026-07-05T05:10:47+00:00 | zai-coding/glm-5 | code_debug | 1 | 4 | — | 52.0s | — | The root_cause correctly identifies that the sort key uses the message string instead of the timestamp, and the fix correctly sorts by `e['timestamp']`. However, the buggy_line is reported as 8 instea |
| 2026-07-05T05:11:30+00:00 | zai-coding/glm-5 | code_gen_long | 2 | 5 | ✓ | 18.5s | — | The code compiles, defines a RateLimiter class with all 3 required methods (__init__, try_acquire, time_until_available) with proper type hints, uses threading.Lock with context managers for thread-sa |
| 2026-07-05T05:11:41+00:00 | zai-coding/glm-5 | summarize | 2 | 5 | ✓ | 21.8s | — | Exactly 5 bullets, each ≤20 words (all 10-12 words), cleanly maps to the 5 required points (Core Decision, Main Reason, Main Trade-off, Evaluation Metric, Open Question) with no fluff. Structure and c |
| 2026-07-05T05:11:53+00:00 | zai-coding/glm-5 | reasoning_multistep | 1 | 5 | ✓ | 15.0s | — | All four steps are shown correctly: head start distance (30 km), remaining gap (210 km), combined closing speed (140 km/h), and meeting time calculation yielding 11:00 AM exactly. |
| 2026-07-05T05:11:02+00:00 | zai-coding/glm-5 | code_debug | 2 | 1 | — | 65.3s | — | The model never returned a proper JSON object as required, instead providing lengthy stream-of-consciousness analysis. It identified the buggy line as 8 rather than the expected 7, and while the root |
| 2026-07-05T05:11:33+00:00 | zai-coding/glm-5 | summarize | 1 | 5 | ✓ | 47.7s | — | Response delivers exactly 5 bullets, each well under 20 words (range 10-15), covering the five required architecture-decision points (Decision, Reason, Trade-off, Metric, Question) with no fluff or fa |
| 2026-07-05T05:12:09+00:00 | zai-coding/glm-5 | reasoning_multistep | 2 | 5 | ✓ | 17.6s | — | All 4 steps are shown correctly with accurate calculations: head start distance (30 km), remaining distance (210 km), combined closing speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11: |
| 2026-07-05T05:11:42+00:00 | zai-coding/glm-5 | creative_write | 1 | 0 | — | 44.5s | — | judge_parse_error: Let me carefully evaluate this response against all the criteria: 1. **270-330 words**: The response includes the planning/drafting notes AND the story. The story itself, counting |
| 2026-07-05T05:12:10+00:00 | zai-coding/glm-5 | agentic_prompt | 1 | 5 | ✓ | 21.8s | — | Valid JSON output. Plan has exactly 5 items (within 4-6 range). Risks has exactly 4 items (within 2-4 range). first_action is a concrete, immediately runnable shell command with which/find/exiftool in |
| 2026-07-05T05:11:49+00:00 | zai-coding/glm-5 | creative_write | 2 | 5 | ✓ | 61.1s | — | The story meets all stated criteria: first-person present tense throughout, named protagonist (Sargeant), contains 'lamp', 'fog', and 'circuits', ends on the single dialogue line 'COME HOME', and word |
| 2026-07-05T05:12:35+00:00 | zai-coding/glm-4.7 | json_strict | 2 | 5 | ✓ | 34.8s | — | Response is valid JSON with exactly 4 objects, all containing the required 6 keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs (CMG001, AUR042, TKL100, BOK512) are uppercase 6-character |
| 2026-07-05T05:12:35+00:00 | zai-coding/glm-4.7 | json_strict | 1 | 5 | ✓ | 34.5s | — | Response is valid JSON with exactly 4 objects, all containing consistent keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6 characters (CMG001, AUR042, TKL100, BOK512), al |
| 2026-07-05T05:12:14+00:00 | zai-coding/glm-5 | agentic_prompt | 2 | 0 | — | 86.4s | — | The response never produces a JSON object — it is only truncated stream-of-consciousness deliberation ending mid-word ('pytho'). None of the required keys (plan, risks, first_action, dry_run_command) |
| 2026-07-05T05:12:57+00:00 | zai-coding/glm-4.7 | code_gen_long | 1 | 5 | ✓ | 43.1s | — | All criteria are met: the RateLimiter class compiles, has 3 methods (__init__, try_acquire, time_until_available) with proper type-annotated signatures, uses threading.Lock with context managers for t |
| 2026-07-05T05:13:13+00:00 | zai-coding/glm-4.7 | summarize | 1 | 4 | — | 32.8s | — | Format is correct: exactly 5 bullets, each well under 20 words (6-9 words each), concise with no fluff. Without the source material, full verification of factual accuracy and completeness of all 5 req |
| 2026-07-05T05:13:13+00:00 | zai-coding/glm-4.7 | code_gen_long | 2 | 5 | ✓ | 40.8s | — | All criteria are met: the RateLimiter class compiles, contains all three expected methods (__init__, try_acquire, time_until_available) with correct signatures, uses threading.Lock for thread safety, |
| 2026-07-05T05:13:43+00:00 | zai-coding/glm-4.7 | summarize | 2 | 4 | ✓ | 30.9s | — | Meets all format criteria perfectly: exactly 5 bullets, each well under 20 words, concise with no fluff. Content appears to cover modular monolith, deploy friction, microservices trade-off, monitoring |
| 2026-07-05T05:13:58+00:00 | zai-coding/glm-4.7 | reasoning_multistep | 1 | 5 | ✓ | 24.6s | — | All 4 steps are shown correctly: (1) Train A's head start distance = 30 km, (2) remaining distance = 210 km, (3) combined speed = 140 km/h, (4) time to meet = 1.5 hours, yielding 11:00 AM. Final answe |
| 2026-07-05T05:14:19+00:00 | zai-coding/glm-4.7 | reasoning_multistep | 2 | 5 | ✓ | 27.9s | — | All 4 steps are shown correctly with proper calculations: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). Fi |
| 2026-07-05T05:12:27+00:00 | zai-coding/glm-4.7 | code_debug | 2 | 3 | — | 157.0s | — | The fix correctly sorts by timestamp before extracting messages, and the root cause captures the alphabetical-sorting issue. However, the buggy_line is reported as 8 instead of the expected 7, which i |
| 2026-07-05T05:15:10+00:00 | mistral/mistral-large | code_debug | 1 | 5 | ✓ | 1.5s | — | Correctly identifies buggy_line as 7, accurately explains root cause (sort key is message not timestamp), and provides a valid fix using `next()` with a generator expression to look up the timestamp f |
| 2026-07-05T05:15:13+00:00 | mistral/mistral-large | code_debug | 2 | 5 | ✓ | 1.5s | — | All three criteria are met: buggy_line is correctly identified as 7, root_cause correctly explains the sort key is the message string instead of the timestamp, and the fix correctly sorts by the times |
| 2026-07-05T05:15:17+00:00 | mistral/mistral-large | json_strict | 1 | 5 | ✓ | 3.7s | — | Valid JSON array with 4 complete objects. All SKUs (CMG001, AUR042, TKL100, BOK512) are 6 uppercase characters. All dates are in YYYY-MM-DD format. All tags are lowercase. All required keys (sku, name |
| 2026-07-05T05:15:26+00:00 | mistral/mistral-large | json_strict | 2 | 5 | ✓ | 3.7s | — | All criteria are met: response is valid JSON, contains exactly 4 objects, all keys (sku, name, price_usd, in_stock, tags, added_on) are present consistently, SKUs are all 6 uppercase alphanumeric char |
| 2026-07-05T05:15:34+00:00 | mistral/mistral-large | code_gen_long | 1 | 0 | — | 5.2s | — | no_response |
| 2026-07-05T05:15:39+00:00 | mistral/mistral-large | code_gen_long | 2 | 0 | — | 4.9s | — | no_response |
| 2026-07-05T05:15:44+00:00 | mistral/mistral-large | summarize | 1 | 0 | — | 5.9s | — | no_response |
| 2026-07-05T05:13:45+00:00 | zai-coding/glm-4.7 | creative_write | 1 | 0 | — | 119.6s | — | judge_parse_error: Let me evaluate this creative writing response against the criteria: 1. **270-330 words**: Let me count the words in the response. "The fog presses against the glass panes, a solid |
| 2026-07-05T05:15:50+00:00 | mistral/mistral-large | summarize | 2 | 0 | — | 5.2s | — | no_response |
| 2026-07-05T05:15:53+00:00 | mistral/mistral-large | creative_write | 1 | 0 | — | 5.3s | — | no_response |
| 2026-07-05T05:15:55+00:00 | mistral/mistral-large | creative_write | 2 | 0 | — | 5.4s | — | no_response |
| 2026-07-05T05:15:59+00:00 | mistral/mistral-large | reasoning_multistep | 1 | 5 | ✓ | 5.5s | — | All 4 steps are shown correctly: distance calculation (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (11:00 AM). Final answer matches expected. |
| 2026-07-05T05:16:00+00:00 | mistral/mistral-large | reasoning_multistep | 2 | 5 | ✓ | 4.9s | — | All 4 steps are shown correctly: head start distance (30 km), remaining distance (210 km), combined closing speed (140 km/h), and meeting time calculation yielding 11:00 AM. |
| 2026-07-05T05:16:06+00:00 | mistral/mistral-large | agentic_prompt | 2 | 4 | ✓ | 8.2s | — | Valid JSON with 6 plan items (meets 4-6 range), 4 risks (meets 2-4 range), and a single-line dry_run_command. The first_action is concrete and runnable but is excessively verbose—it embeds an entire b |
| 2026-07-05T05:16:06+00:00 | mistral/mistral-large | agentic_prompt | 1 | 5 | ✓ | 9.9s | — | All criteria are met: the response is valid JSON, the plan has 6 items (within 4-6 range), risks has 4 items (within 2-4 range), the first_action is a concrete runnable command creating the test direc |
| 2026-07-05T05:12:27+00:00 | zai-coding/glm-4.7 | code_debug | 1 | 0 | — | 240.2s | — | no_response |
| 2026-07-05T05:16:22+00:00 | mistral/mistral-medium | code_debug | 2 | 5 | ✓ | 0.7s | — | The response correctly identifies line 7 as the buggy line, explains the root cause as sorting by message string instead of timestamp, and provides a fix that sorts by timestamp (the 'restructure to k |
| 2026-07-05T05:16:20+00:00 | mistral/mistral-medium | code_debug | 1 | 4 | ✓ | 0.8s | — | buggy_line correctly identifies line 7, root_cause correctly identifies sort by message string instead of timestamp. The fix `errors.sort(key=lambda e: e['timestamp'])` assumes `errors` contains dicts |
| 2026-07-05T05:16:29+00:00 | mistral/mistral-medium | json_strict | 2 | 5 | ✓ | 1.4s | — | Response is valid JSON containing exactly 4 objects. All objects have the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, B |
| 2026-07-05T05:16:27+00:00 | mistral/mistral-medium | json_strict | 1 | 5 | ✓ | 4.2s | — | Response is valid JSON with exactly 4 objects, all keys present (sku, name, price_usd, in_stock, tags, added_on), all SKUs are 6 uppercase chars (CMG001, AUR042, TKL100, BOK512), all dates are in YYYY |
| 2026-07-05T05:16:35+00:00 | mistral/mistral-medium | summarize | 1 | 5 | ✓ | 1.2s | — | Exactly 5 bullets, each well under 20 words, covering Decision, Reason, Trade-off, Metric, and Open question with no fluff or factual drift. |
| 2026-07-05T05:16:30+00:00 | mistral/mistral-medium | code_gen_long | 1 | 5 | ✓ | 9.9s | — | The code compiles correctly, defines a RateLimiter class with __init__, try_acquire, and time_until_available methods with proper type hints, uses threading.Lock for thread safety, and includes a __ma |
| 2026-07-05T05:14:48+00:00 | zai-coding/glm-4.7 | agentic_prompt | 2 | 3 | — | 110.3s | — | Plan has 4 items and risks has 4 items (both within range). first_action is a concrete, immediately runnable heredoc. However, the dry_run_command is clearly truncated mid-line ('python3 /tmp/organize |
| 2026-07-05T05:14:24+00:00 | zai-coding/glm-4.7 | agentic_prompt | 1 | 2 | — | 133.2s | — | The JSON appears truncated (no closing brace visible) and the required 'dry_run_command' field is entirely missing, violating a key criterion. While plan (6 items) and risks (4 items) meet the count r |
| 2026-07-05T05:13:52+00:00 | zai-coding/glm-4.7 | creative_write | 2 | 0 | — | 164.4s | — | judge_parse_error: Let me analyze the model's response against the criteria: 1. **Word count: 270-330 words** Let me count... "I wipe the condensation from the windowpane, but the glass remains opaq |
| 2026-07-05T05:16:42+00:00 | mistral/mistral-medium | summarize | 2 | 5 | ✓ | 0.9s | — | Response delivers exactly 5 bullets, each under 20 words, with clear structure (Decision, Reason, Trade-off, Metric, Open question). No fluff, concise, and each bullet is factually grounded with no dr |
| 2026-07-05T05:16:47+00:00 | mistral/mistral-medium | reasoning_multistep | 2 | 5 | ✓ | 2.2s | — | All 4 steps are correctly shown: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). Final answer matches expected. |
| 2026-07-05T05:16:44+00:00 | mistral/mistral-medium | reasoning_multistep | 1 | 5 | ✓ | 5.7s | — | All 4 steps are shown correctly: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). Final answer matches expected. |
| 2026-07-05T05:16:35+00:00 | mistral/mistral-medium | code_gen_long | 2 | 5 | ✓ | 3.8s | — | The code meets all criteria: it compiles, contains 3 methods (__init__, try_acquire, time_until_available) with proper type hints, uses threading.Lock with 'with self.lock:' blocks for thread safety, |
| 2026-07-05T05:16:43+00:00 | mistral/mistral-medium | creative_write | 1 | 0 | — | 3.6s | — | judge_parse_error: Let me evaluate this response against the criteria: 1. **Word count: 270-330 words**: Let me count. "I wake to the lamp's steady pulse, the fog pressed against the glass like a br |
| 2026-07-05T05:16:48+00:00 | mistral/mistral-medium | agentic_prompt | 1 | 5 | ✓ | 5.3s | — | Valid JSON with all required fields. Plan has 5 items (within 4-6), risks has 4 items (within 2-4), first_action is a concrete heredoc command that immediately writes a runnable Python script, and dry |
| 2026-07-05T05:16:52+00:00 | mistral/devstral | code_debug | 1 | 3 | — | 1.2s | — | Correctly identifies buggy_line as 7 and accurately states the root cause (sort key is message string instead of timestamp). However, the fix `errors.sort(key=lambda e: e['timestamp'])` is broken as w |
| 2026-07-05T05:16:44+00:00 | mistral/mistral-medium | creative_write | 2 | 0 | — | 5.0s | — | judge_parse_error: Let me evaluate this response against the criteria: 1. **Word count 270-330**: Let me count the words in the response. "I wake to the lamp's steady pulse, the fog pressed against t |
| 2026-07-05T05:16:54+00:00 | mistral/devstral | code_debug | 2 | 2 | — | 1.1s | — | root_cause correctly identifies that the sort key is the message string instead of the timestamp (matches expected). However, buggy_line is reported as 8 instead of 7. The fix `errors.sort(key=lambda |
| 2026-07-05T05:16:51+00:00 | mistral/mistral-medium | agentic_prompt | 2 | 5 | ✓ | 5.0s | — | All criteria are met: valid JSON, plan has exactly 5 items (within 4-6 range), risks has exactly 4 items (within 2-4 range), first_action is a concrete heredoc that creates an immediately runnable Pyt |
| 2026-07-05T05:16:57+00:00 | mistral/devstral | json_strict | 2 | 5 | ✓ | 3.4s | — | Valid JSON with exactly 4 objects, all containing sku/name/price_usd/in_stock/tags/added_on keys. All SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512). All dates are YYYY-MM-DD format. All |
| 2026-07-05T05:17:00+00:00 | mistral/devstral | summarize | 1 | 5 | ✓ | 1.9s | — | Exactly 5 bullets, each well under 20 words (ranging 9-13 words). Captures all five standard architecture decision summary points (Decision, Reason, Trade-off, Metric, Open question) with concrete spe |
| 2026-07-05T05:17:00+00:00 | mistral/devstral | summarize | 2 | 4 | ✓ | 2.2s | — | Format is perfect: exactly 5 bullets, all well under 20 words, no fluff. Covers a logical decision/reason/trade-off/metric/open-question structure. Cannot fully verify 'captures all 5 required points' |
| 2026-07-05T05:16:54+00:00 | mistral/devstral | json_strict | 1 | 5 | ✓ | 10.1s | — | All criteria are met: valid JSON, exactly 4 objects, all SKUs are 6 uppercase characters (CMG001, AUR042, TKL100, BOK512), all dates follow YYYY-MM-DD format, and all tags are lowercase. |
| 2026-07-05T05:17:00+00:00 | mistral/devstral | code_gen_long | 2 | 5 | ✓ | 11.0s | — | All criteria are met: the class compiles, has 3 main methods (__init__, try_acquire, time_until_available) with proper signatures, uses threading.Lock for thread safety (self._lock with 'with' context |
| 2026-07-05T05:17:07+00:00 | mistral/devstral | reasoning_multistep | 1 | 5 | ✓ | 4.9s | — | All 4 steps are correctly shown: head start distance (30 km), distance between trains at 9:30 AM (210 km), combined closing speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). The |
| 2026-07-05T05:17:08+00:00 | mistral/devstral | reasoning_multistep | 2 | 5 | ✓ | 4.7s | — | All 4 steps are clearly shown with correct calculations: head start distance (30 km), remaining gap (210 km), combined speed (140 km/h), and time to meet (1.5 hours after 9:30 AM = 11:00 AM). Final an |
| 2026-07-05T05:16:59+00:00 | mistral/devstral | code_gen_long | 1 | 5 | ✓ | 10.9s | — | The code meets all stated criteria: the class compiles, includes all 3 required methods (__init__, try_acquire, time_until_available) with correct signatures, uses threading.Lock for thread safety, ha |
| 2026-07-05T05:17:05+00:00 | mistral/devstral | creative_write | 1 | 0 | — | 6.6s | — | judge_parse_error: Let me evaluate this response against the criteria: 1. **270-330 words**: Let me count... The response appears to be around 230-250 words. Let me recount carefully. "I wake to the |
| 2026-07-05T05:17:14+00:00 | mistral/devstral | agentic_prompt | 1 | 5 | ✓ | 2.9s | — | All criteria are met: valid JSON output, plan contains exactly 5 items (within 4-6 range), risks contains exactly 4 items (within 2-4 range), first_action is a concrete immediately runnable shell comm |
| 2026-07-05T05:17:15+00:00 | mistral/devstral | agentic_prompt | 2 | 5 | ✓ | 2.9s | — | Valid JSON with plan (5 items, within 4-6 range), risks (4 items, within 2-4 range), first_action is a concrete and immediately runnable cp command, and dry_run_command is a single shell line. |
| 2026-07-05T05:17:07+00:00 | mistral/devstral | creative_write | 2 | 0 | — | 3.9s | — | judge_parse_error: Let me evaluate this creative writing piece against the criteria: 1. **Word count: 270-330 words** Let me count: "I wake to the lamp's steady pulse, the fog pressed against the gla |
| 2026-07-05T05:17:15+00:00 | ollama-cl/kimi-k2.5 | code_debug | 1 | 2 | — | 20.9s | — | The model correctly identifies the conceptual root cause (sorting alphabetically by message string instead of by timestamp), but is confused about line numbers, wavering between line 6, 8, and 9 when |
| 2026-07-05T05:17:19+00:00 | ollama-cl/kimi-k2.5 | json_strict | 2 | 0 | — | 22.8s | — | The response contains only the model's internal reasoning/planning text and never actually emits the required JSON array. It is truncated mid-construction ('name': 'Bl...') with no parseable JSON, no |
| 2026-07-05T05:17:16+00:00 | ollama-cl/kimi-k2.5 | code_debug | 2 | 3 | ✓ | 19.8s | — | The model correctly identifies the root cause (sort key is the message string instead of the timestamp) and proposes a valid fix (store tuples of (timestamp, message) then sort). However, it states th |
| 2026-07-05T05:17:19+00:00 | ollama-cl/kimi-k2.5 | json_strict | 1 | 2 | — | 22.4s | — | Response is truncated mid-stream after only 2 complete objects and a partial third ('sku' alone). Cannot verify the required 4 objects. Visible portions show correct SKU format (CMG001, AUR042 - 6 upp |
| 2026-07-05T05:17:20+00:00 | ollama-cl/kimi-k2.5 | code_gen_long | 2 | 5 | ✓ | 30.4s | — | The code meets all stated criteria: the RateLimiter class compiles cleanly, includes __init__, try_acquire, and time_until_available (plus a helper _add_tokens) with proper signatures, uses threading. |
| 2026-07-05T05:17:20+00:00 | ollama-cl/kimi-k2.5 | code_gen_long | 1 | 1 | — | 35.7s | — | The response is truncated mid-implementation, showing only a design discussion and the start of __init__/try_acquire methods. The class is incomplete: time_until_available method is missing, __main__ |
| 2026-07-05T05:17:43+00:00 | ollama-cl/kimi-k2.5 | summarize | 1 | 4 | — | 11.0s | — | Meets the 5-bullet and ≤20-word constraints cleanly, but the final bullet is phrased as an open question ('...or implement immediately?') rather than a conclusive summary point, which is a minor devia |
| 2026-07-05T05:17:45+00:00 | ollama-cl/kimi-k2.5 | summarize | 2 | 5 | ✓ | 15.5s | — | Exactly 5 bullets, each under 20 words (all 9-11 words), captures concrete technical/strategic points (modular monolith, timeline/cost tradeoff, specific SLOs, service mesh deferral), no fluff or fill |
| 2026-07-05T05:18:00+00:00 | ollama-cl/kimi-k2.5 | reasoning_multistep | 2 | 5 | ✓ | 13.4s | — | All four steps are clearly shown and correct: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time (11:00 AM). The verification confirms the answer. |
| 2026-07-05T05:17:56+00:00 | ollama-cl/kimi-k2.5 | reasoning_multistep | 1 | 5 | ✓ | 17.6s | — | All 4 steps are shown correctly with accurate calculations: Train A's 30 km head start, 210 km remaining distance, 140 km/h combined speed, and 1.5 hours to meet resulting in 11:00 AM. The final answe |
| 2026-07-05T05:17:47+00:00 | ollama-cl/kimi-k2.5 | creative_write | 2 | 0 | — | 27.6s | — | The response contains only the model's internal chain-of-thought and drafting notes, not the requested creative story. The actual output cuts off mid-word-count check and never delivers a complete 270 |
| 2026-07-05T05:17:46+00:00 | ollama-cl/kimi-k2.5 | creative_write | 1 | 1 | — | 27.2s | — | The response is fatally compromised by visible meta-commentary and drafting notes ('Drafting:', 'Wait, that's 236 words...', 'Word count check: 1. The 2. lamp 3. tur') that should not appear in the fi |
| 2026-07-05T05:18:00+00:00 | ollama-cl/kimi-k2.5 | agentic_prompt | 1 | 1 | — | 42.7s | — | The response is truncated mid-sentence at 'Execute' in the plan array, making the JSON invalid and incomplete. Cannot verify risks (2-4 items), first_action (concrete/runnable), or dry_run_command (si |
| 2026-07-05T05:18:03+00:00 | ollama-cl/kimi-k2.5 | agentic_prompt | 2 | 0 | — | 38.5s | — | The model never produced the required JSON object. Instead, it output meta-reasoning/brainstorming text that was truncated mid-sentence. None of the four required keys (plan, risks, first_action, dry_ |
| 2026-07-05T05:18:14+00:00 | ollama-cl/nemotron-3-ultra | code_debug | 1 | 5 | ✓ | 40.5s | — | The response correctly identifies line 7 (`errors.sort(key=lambda m: m)`) as the buggy line, accurately explains that the sort key uses the message string instead of the timestamp (causing alphabetica |
| 2026-07-05T05:18:15+00:00 | ollama-cl/nemotron-3-ultra | code_debug | 2 | 5 | ✓ | 68.4s | — | The response correctly identifies the buggy line (line 7: `errors.sort(key=lambda m: m)`), explains the root cause (sort key is the message string instead of timestamp, causing alphabetical rather tha |
| 2026-07-05T05:18:47+00:00 | ollama-cl/nemotron-3-ultra | code_gen_long | 2 | 5 | ✓ | 58.9s | — | The code compiles, defines a RateLimiter class with all 3 required methods (__init__, try_acquire, time_until_available) having proper signatures with type hints, uses threading.Lock correctly (acquir |
| 2026-07-05T05:18:20+00:00 | ollama-cl/nemotron-3-ultra | json_strict | 1 | 5 | ✓ | 102.7s | — | Response is valid JSON with exactly 4 objects. All required keys present. SKUs (CMG001, AUR042, TKL100, BOK512) are 6 uppercase chars. Dates are in YYYY-MM-DD format. All tags are lowercase. |
| 2026-07-05T05:18:25+00:00 | ollama-cl/nemotron-3-ultra | json_strict | 2 | 5 | ✓ | 118.2s | — | Valid JSON array with 4 objects, all having consistent keys (sku, name, price_usd, in_stock, tags, added_on). SKUs are all 6 uppercase chars (CMG001, AUR042, TKL100, BOK512), dates are in YYYY-MM-DD f |
| 2026-07-05T05:18:57+00:00 | ollama-cl/nemotron-3-ultra | summarize | 1 | 5 | ✓ | 94.6s | — | Exactly 5 bullets, each under 20 words, all information-dense with no fluff. Bullets coherently cover architecture choice, problem diagnosis, cost/timeline, target metrics, and deferred decision. Inte |
| 2026-07-05T05:19:26+00:00 | ollama-cl/nemotron-3-ultra | summarize | 2 | 5 | — | 77.9s | — | Exactly 5 bullets, each well under 20 words, concise and information-dense with no fluff. Captures decision rationale, problem quantification, trade-offs, specific targets/metrics, and open question — |
| 2026-07-05T05:20:28+00:00 | ollama-cl/nemotron-3-ultra | reasoning_multistep | 1 | 5 | ✓ | 33.6s | — | All 4 steps are shown correctly with proper calculations: 30 km head start, 210 km remaining distance, 140 km/h closing speed, and 1.5 hours to meet resulting in 11:00 AM. Final answer matches expecte |
| 2026-07-05T05:18:46+00:00 | ollama-cl/nemotron-3-ultra | code_gen_long | 1 | 5 | ✓ | 152.8s | — | The code compiles correctly, defines a RateLimiter class with __init__, try_acquire, and time_until_available methods with proper signatures, uses threading.Lock for thread safety, has comprehensive t |
| 2026-07-05T05:20:38+00:00 | ollama-cl/nemotron-3-ultra | reasoning_multistep | 2 | 5 | ✓ | 77.1s | — | All four steps are shown correctly: head start distance (30 km), remaining distance (210 km), combined closing speed (140 km/h), and meeting time calculation yielding 11:00 AM. Final answer matches ex |
| 2026-07-05T05:21:23+00:00 | qwen3.6:27b | code_debug | 1 | 0 | — | 105.5s | — | no_response |
| 2026-07-05T05:20:49+00:00 | ollama-cl/nemotron-3-ultra | agentic_prompt | 1 | 5 | ✓ | 138.6s | — | Valid JSON with plan (6 items, within 4-6 range), risks (4 items, within 2-4 range), a concrete and immediately runnable first_action (heredoc creating the script), and a single-line dry_run_command. |
| 2026-07-05T05:21:03+00:00 | ollama-cl/nemotron-3-ultra | agentic_prompt | 2 | 5 | ✓ | 145.0s | — | Response is valid JSON, plan has exactly 6 items (within 4-6), risks has exactly 4 items (within 2-4), first_action 'ls -la /var/lib/chaos/ | head -20' is concrete and immediately runnable, and dry_ru |
| 2026-07-05T05:19:52+00:00 | ollama-cl/nemotron-3-ultra | creative_write | 1 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:20:07+00:00 | ollama-cl/nemotron-3-ultra | creative_write | 2 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:21:57+00:00 | qwen3.6:27b | code_debug | 2 | 0 | — | 138.8s | — | no_response |
| 2026-07-05T05:23:08+00:00 | qwen3.6:27b | json_strict | 1 | 0 | — | 135.8s | — | no_response |
| 2026-07-05T05:23:13+00:00 | qwen3.6:27b | json_strict | 2 | 0 | — | 198.7s | — | no_response |
| 2026-07-05T05:23:32+00:00 | qwen3.6:27b | code_gen_long | 1 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:23:52+00:00 | qwen3.6:27b | code_gen_long | 2 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:24:07+00:00 | qwen3.6:27b | summarize | 1 | 0 | — | 240.2s | — | no_response |
| 2026-07-05T05:24:16+00:00 | qwen3.6:27b | summarize | 2 | 0 | — | 240.2s | — | no_response |
| 2026-07-05T05:25:24+00:00 | qwen3.6:27b | creative_write | 1 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:26:32+00:00 | qwen3.6:27b | creative_write | 2 | 0 | — | 240.2s | — | no_response |
| 2026-07-05T05:27:32+00:00 | qwen3.6:27b | reasoning_multistep | 1 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:27:52+00:00 | qwen3.6:27b | reasoning_multistep | 2 | 0 | — | 240.2s | — | no_response |
| 2026-07-05T05:28:07+00:00 | qwen3.6:27b | agentic_prompt | 1 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:28:16+00:00 | qwen3.6:27b | agentic_prompt | 2 | 0 | — | 240.1s | — | no_response |
| 2026-07-05T05:23:44+00:00 | claude-fable | code_debug | 2 | 3 | — | 18.6s | $0.0012 | The root_cause correctly identifies that the sort key is the message string instead of the timestamp, and the fix correctly sorts by timestamp. However, the buggy_line is reported as 8 instead of the |
| 2026-07-05T05:23:44+00:00 | claude-fable | code_debug | 1 | 4 | ✓ | 22.4s | $0.0012 | The root_cause correctly identifies that the sort key is the message string rather than the timestamp, resulting in alphabetical instead of chronological ordering. The fix correctly sorts by the corre |
| 2026-07-05T05:24:09+00:00 | claude-fable | json_strict | 1 | 5 | ✓ | 9.6s | $0.0010 | Response is valid JSON with exactly 4 objects, all containing required keys. SKUs (CMG001, AUR042, TKL100, BOK512) are uppercase 6-character codes, dates follow YYYY-MM-DD format, and all tags are low |
| 2026-07-05T05:24:12+00:00 | claude-fable | json_strict | 2 | 5 | ✓ | 12.3s | $0.0010 | Response is valid JSON containing exactly 4 objects. All objects have consistent keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-character codes (CMG001, AUR042, TKL100 |
| 2026-07-05T05:24:23+00:00 | claude-fable | code_gen_long | 1 | 5 | ✓ | 14.1s | $0.0008 | All criteria met: the class compiles, has 3 methods (__init__, try_acquire, time_until_available) with proper type hints and signatures, uses threading.Lock with context manager, has type hints throug |
| 2026-07-05T05:24:29+00:00 | claude-fable | code_gen_long | 2 | 5 | ✓ | 15.7s | $0.0008 | Class compiles correctly, includes __init__, try_acquire, and time_until_available (3 methods) with proper type hints and signatures, uses threading.Lock with context manager in both public methods, a |
| 2026-07-05T05:24:42+00:00 | claude-fable | summarize | 1 | 5 | ✓ | 7.0s | $0.0011 | Exactly 5 bullets provided, each well under 20 words (max ~14), no fluff, captures Decision/Reason/Trade-off/Metrics/Open question with concrete specifics. |
| 2026-07-05T05:24:50+00:00 | claude-fable | summarize | 2 | 5 | ✓ | 12.2s | $0.0011 | Exactly 5 bullets, each within 20 words (10–15 words each), covering 5 distinct required points (Decision, Reason, Trade-off, Metrics, Open question) with concise, non-redundant content and no factual |
| 2026-07-05T05:24:53+00:00 | claude-fable | creative_write | 1 | 0 | — | 34.2s | $0.0008 | judge_parse_error: Let me evaluate the response against the criteria: 1. **Word count: 270-330 words** Let me count... "The fog has been on the water for three days when the receiver starts its chat |
| 2026-07-05T05:25:37+00:00 | claude-fable | reasoning_multistep | 1 | 5 | ✓ | 7.4s | $0.0008 | All 4 steps are shown correctly with clear arithmetic: head start (30 km), gap (210 km), combined speed (140 km/h), and meeting time (1.5 h). Final answer 11:00 AM matches the expected answer, and the |
| 2026-07-05T05:25:47+00:00 | claude-fable | reasoning_multistep | 2 | 5 | ✓ | 10.0s | $0.0008 | All 4 steps are shown correctly: Step 1 calculates 30 km head start, Step 2 computes 210 km gap, Step 3 uses combined speed 140 km/h, Step 4 divides to get 1.5 hours arriving at 11:00 AM. Final answer |
| 2026-07-05T05:25:06+00:00 | claude-fable | creative_write | 2 | 5 | ✓ | 44.1s | $0.0008 | Meets all criteria: ~283 words (within 270-330), first-person present tense throughout, named protagonist Aldous Finn, ends on a single line of dialogue ('Still here, love. Send again.'), and contains |
| 2026-07-05T05:26:00+00:00 | claude-fable | agentic_prompt | 2 | 5 | ✓ | 18.6s | $0.0009 | All criteria are met: valid JSON, plan has exactly 6 items (within 4-6), risks has exactly 4 items (within 2-4), first_action is a concrete single-shell-pipeline that is immediately runnable for EXIF |
| 2026-07-05T05:25:58+00:00 | claude-fable | agentic_prompt | 1 | 5 | ✓ | 25.6s | $0.0009 | Valid JSON object with all required keys. Plan has exactly 6 items (within 4-6 range), risks has exactly 4 items (within 2-4 range), first_action is a concrete, immediately runnable shell pipeline usi |
| 2026-07-05T05:06:24+00:00 | codex-gpt-5.5 | json_strict | 2 | 5 | ✓ | 10.0s | — | All criteria met: valid JSON, exactly 4 objects with consistent keys (sku, name, price_usd, in_stock, tags, added_on), all SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), all dates in YYY |
| 2026-07-05T05:06:24+00:00 | codex-gpt-5.5 | code_debug | 2 | 3 | — | 12.5s | — | The root_cause correctly identifies that messages are sorted alphabetically instead of by timestamp, and the fix correctly sorts qualifying entries by timestamp before extracting messages. However, bu |
| 2026-07-05T05:06:24+00:00 | codex-gpt-5.5 | json_strict | 1 | 5 | ✓ | 13.5s | — | Response is valid JSON with 4 objects, each containing all required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6-character format (CMG001, AUR042, TKL100, BOK512), a |
| 2026-07-05T05:06:24+00:00 | codex-gpt-5.5 | code_debug | 1 | 2 | — | 18.8s | — | The buggy line is identified as 8 instead of the expected line 7. The root cause explanation is too vague—it says timestamps were 'discarded' but doesn't specifically identify that the sort key is the |
| 2026-07-05T05:06:40+00:00 | codex-gpt-5.5 | summarize | 1 | 5 | ✓ | 4.6s | — | Exactly 5 bullets, each well under 20 words (longest is ~15 words). Each bullet captures a distinct required point (decision, reason, trade-off, evaluation criteria, open question) with no fluff or ex |
| 2026-07-05T05:06:40+00:00 | codex-gpt-5.5 | code_gen_long | 2 | 5 | ✓ | 14.7s | — | The code compiles, defines a RateLimiter class with __init__, try_acquire, and time_until_available methods having proper type hints and correct signatures, uses threading.Lock for thread safety via ' |
| 2026-07-05T05:06:49+00:00 | codex-gpt-5.5 | summarize | 2 | 2 | — | 4.0s | — | Format compliance is correct (exactly 5 bullets, each ≤20 words), but the response appears to fabricate content about an unspecified topic rather than summarize given material. The last bullet is an o |
| 2026-07-05T05:07:01+00:00 | codex-gpt-5.5 | reasoning_multistep | 1 | 5 | ✓ | 6.8s | — | All 4 steps are shown correctly with proper calculations: head start distance (30 km), remaining gap (210 km), combined closing speed (140 km/h), and meeting time (1.5 hours after 9:30 AM = 11:00 AM). |
| 2026-07-05T05:06:49+00:00 | codex-gpt-5.5 | creative_write | 1 | 0 | — | 19.5s | — | judge_parse_error: Let me evaluate the model's response against the criteria: 1. **Word count: 270-330 words** Let me count the words: "My name is Elias Ward, and I keep the light on Saint Oran, a bl |
| 2026-07-05T05:07:09+00:00 | codex-gpt-5.5 | reasoning_multistep | 2 | 5 | ✓ | 6.8s | — | All 4 steps are shown correctly: head start distance (30 km), remaining distance (210 km), combined speed (140 km/h), and meeting time calculation yielding 11:00 AM. Final answer matches expected. |
| 2026-07-05T05:06:39+00:00 | codex-gpt-5.5 | code_gen_long | 1 | 2 | — | 34.0s | — | The response describes implementing a RateLimiter class but never actually shows the code itself. While it claims verification with py_compile and demonstrates output, the actual source code is not vi |
| 2026-07-05T05:07:00+00:00 | codex-gpt-5.5 | creative_write | 2 | 5 | ✓ | 18.8s | — | Meets all criteria: ~314 words (within 270-330), first-person present throughout ('I keep', 'I trim', 'I climb'), named protagonist 'Elias Ward', ends on single dialogue line ('Tell me which one, Mara |
| 2026-07-05T05:07:18+00:00 | codex-gpt-5.5 | agentic_prompt | 2 | 5 | ✓ | 12.2s | — | Valid JSON, plan has 6 items (within 4-6 range), risks has 4 items (within 2-4 range), first_action is a concrete immediately runnable shell command checking exiftool availability and counting JPG fil |
| 2026-07-05T05:07:19+00:00 | codex-gpt-5.4 | code_debug | 1 | 3 | — | 12.1s | — | The root_cause correctly identifies that messages are sorted alphabetically (by message string) instead of chronologically by timestamp. The fix correctly sorts log_entries by timestamp before extract |
| 2026-07-05T05:07:35+00:00 | codex-gpt-5.4 | json_strict | 1 | 5 | ✓ | 5.8s | — | Valid JSON array with exactly 4 objects, all containing the required keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), all dates a |
| 2026-07-05T05:07:37+00:00 | codex-gpt-5.4 | json_strict | 2 | 5 | ✓ | 7.6s | — | Response is valid JSON with exactly 4 objects, all sharing consistent keys (sku, name, price_usd, in_stock, tags, added_on). All SKUs are uppercase 6 chars (CMG001, AUR042, TKL100, BOK512), all dates |
| 2026-07-05T05:07:29+00:00 | codex-gpt-5.4 | code_debug | 2 | 4 | ✓ | 10.0s | — | Root cause correctly identifies sorting by message strings alphabetically instead of by timestamp, and the fix is valid (restructures to sort entries by timestamp then extract messages). However, the |
| 2026-07-05T05:07:17+00:00 | codex-gpt-5.5 | agentic_prompt | 1 | 2 | — | 33.1s | — | The response is missing the required 'dry_run_command' field entirely; only 'plan', 'risks', and 'first_action' are provided. While the plan (5 items) and risks (4 items) meet the count criteria and f |
Showing first 200 trials. Full data: data/all_trials.json in the repo. Download as CSV (973 rows, includes 1P-est per trial). Per-trial text files: /raw/trials/ in the deployed site.