Overview
Leaderboard
| Model | Source | Overall | w/o Math | Math | Physics | Counting | Game |
| Human Expert ๐ | / | 82.3 | 81.7 | 84.7 | 69.4 | 81.0 | 89.1 |
| DreamPRM-1.5 (GPT-5-mini)* ๐ฅ | Link | 31.3 | 26.0 | 50.0 | 38.9 | 24.6 | 19.6 |
| GPT-5-mini ๐ฅ | Link | 27.9 | 22.2 | 48.3 | 31.8 | 22.6 | 16.4 |
| OpenAI o3 ๐ฅ | Link | 25.8 | 19.5 | 48.3 | 20.4 | 22.1 | 17.1 |
| OpenAI o4-mini | Link | 20.9 | 14.6 | 43.2 | 12.7 | 17.4 | 13.8 |
| Gemini 2.5 pro-preview-0506 | Link | 20.2 | 13.9 | 42.6 | 9.6 | 19.0 | 12.7 |
| Doubao-1.5-thinking-pro-m | Link | 17.1 | 11.0 | 38.6 | 13.4 | 9.7 | 10.5 |
| OpenAI o1 | Link | 16.2 | 11.0 | 34.7 | 5.7 | 12.3 | 13.1 |
| Doubao-1.5-vision-pro | Link | 15.6 | 11.5 | 30.1 | 8.9 | 12.8 | 12.0 |
| OpenAI GPT-4o-20250327 | Link | 14.1 | 11.2 | 24.4 | 3.2 | 13.3 | 14.2 |
| OpenAI GPT-4.1 | Link | 13.6 | 11.7 | 20.5 | 5.7 | 11.3 | 15.3 |
| Step-R1-V-Mini | Link | 13.2 | 8.8 | 29.0 | 6.4 | 10.3 | 9.1 |
| OpenAI GPT-4.5 | Link | 12.6 | 11.0 | 18.2 | 2.5 | 11.8 | 15.3 |
| Claude-3.7-sonnet | Link | 11.5 | 9.1 | 19.9 | 3.8 | 8.7 | 12.4 |
| JT-VL-Chat-Thinking-20251015 | Link | 11.1 | 8.3 | 21.6 | 1.9 | 9.2 | 10.9 |
| QVQ-Max | Link | 11.0 | 8.1 | 21.0 | 5.7 | 6.2 | 10.9 |
| Qwen2.5VL-72B | Link | 10.6 | 9.2 | 15.3 | 3.8 | 6.2 | 14.5 |
| InternVL-3-38B | Link | 10.0 | 7.2 | 20.5 | 0.6 | 5.1 | 12.4 |
| Qwen2.5VL-32B | Link | 10.0 | 6.4 | 22.7 | 2.5 | 4.1 | 10.2 |
| MiniCPM-2.6-o | Link | 9.7 | 7.5 | 17.6 | 1.3 | 3.6 | 13.8 |
| Llama4-Scout (109B MoE) | Link | 9.5 | 6.9 | 18.8 | 3.2 | 4.1 | 10.9 |
| MiniCPM-2.6-V | Link | 9.1 | 7.2 | 15.9 | 1.3 | 6.2 | 11.3 |
| LLaVA-OneVision-72B | Link | 9.0 | 8.9 | 9.1 | 4.5 | 4.6 | 14.5 |
| DeepSeek-VL2 | Link | 9.0 | 7.0 | 15.9 | 0.6 | 5.6 | 11.6 |
| LLaVA-OneVision-7B | Link | 8.5 | 6.8 | 14.2 | 2.5 | 4.6 | 10.9 |
| Qwen2.5VL-7B | Link | 8.3 | 7.0 | 13.1 | 2.5 | 3.6 | 12.0 |
| InternVL-3-8B | Link | 8.2 | 6.0 | 15.9 | 1.9 | 5.6 | 8.7 |
| InternVL-3-14B | Link | 8.0 | 7.0 | 11.4 | 1.3 | 5.1 | 11.6 |
| Qwen2.5-Omni-7B | Link | 7.7 | 4.5 | 11.4 | 1.9 | 2.1 | 7.7 |
* Results obtained under "Best-of-4 + PRM selection": for each test instance, four reasoning trajectories are generated, and the Process Reward Model (PRM) selects the most coherent one.