$ tasks
Tasks
Tasks in detail
Click a task to see every model's response. Each task card shows the top 3 models by mean score. The full set of trials for each task is in /raw/.
code_debug- 1 mistral/mistral-large 5.0
- 2 mistral/mistral-medium 4.0
- 3 ollama-cl/nemotron-3-ultra 3.4
139 trials
2.6 avg score
26 pass
code_gen_long- 1 minimax-m2.7-highspeed 5.0
- 2 zai-coding/glm-5.2 5.0
- 3 zai-coding/glm-5 5.0
139 trials
3.7 avg score
95 pass
json_strict- 1 minimax-m3 5.0
- 2 ollama-cl/minimax-m2.5 5.0
- 3 minimax-m2.7 5.0
139 trials
4.6 avg score
127 pass
summarize- 1 claude-fable 5.0
- 2 claude-haiku 5.0
- 3 zai-coding/glm-5 4.8
139 trials
4.2 avg score
106 pass
creative_write- 1 zai-coding/glm-5 2.0
- 2 claude-fable 2.0
- 3 claude-haiku 2.0
139 trials
0.7 avg score
14 pass
reasoning_multistep- 1 minimax-m3 5.0
- 2 ollama-cl/minimax-m2.5 5.0
- 3 minimax-m2.7 5.0
139 trials
4.8 avg score
133 pass
agentic_prompt- 1 ollama-cl/minimax-m2.5 5.0
- 2 minimax-m2.7 5.0
- 3 mistral/devstral 5.0
139 trials
3.8 avg score
94 pass