$ tasks

Tasks

Tasks in detail

Click a task to see every model's response. Each task card shows the top 3 models by mean score. The full set of trials for each task is in /raw/.

code_debug
  1. 1 mistral/mistral-large 5.0
  2. 2 mistral/mistral-medium 4.0
  3. 3 ollama-cl/nemotron-3-ultra 3.4
139 trials 2.6 avg score 26 pass
code_gen_long
  1. 1 minimax-m2.7-highspeed 5.0
  2. 2 zai-coding/glm-5.2 5.0
  3. 3 zai-coding/glm-5 5.0
139 trials 3.7 avg score 95 pass
json_strict
  1. 1 minimax-m3 5.0
  2. 2 ollama-cl/minimax-m2.5 5.0
  3. 3 minimax-m2.7 5.0
139 trials 4.6 avg score 127 pass
summarize
  1. 1 claude-fable 5.0
  2. 2 claude-haiku 5.0
  3. 3 zai-coding/glm-5 4.8
139 trials 4.2 avg score 106 pass
creative_write
  1. 1 zai-coding/glm-5 2.0
  2. 2 claude-fable 2.0
  3. 3 claude-haiku 2.0
139 trials 0.7 avg score 14 pass
reasoning_multistep
  1. 1 minimax-m3 5.0
  2. 2 ollama-cl/minimax-m2.5 5.0
  3. 3 minimax-m2.7 5.0
139 trials 4.8 avg score 133 pass
agentic_prompt
  1. 1 ollama-cl/minimax-m2.5 5.0
  2. 2 minimax-m2.7 5.0
  3. 3 mistral/devstral 5.0
139 trials 3.8 avg score 94 pass