RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Meng-Hao Guo¹, Xuanyu Chu¹, Qianrui Yang¹, Zhe-Han Mo¹, Yiqing Shen¹
Pei-Lin Li¹, Xinjie Lin¹, Jinnian Zhang², Xin-Sheng Chen¹, Yi Zhang¹
Kiyohiro Nakayama³, Zhengyang Geng⁴, Houwen Peng², Han Hu², Shi-Min Hu¹

¹ Tsinghua University, ² Tencent Hunyuan X, ³ Stanford University, ⁴ Carnegie Mellon University

Paper Code

🏆

Leaderboard

🤗

Dataset

R denotes reasoning, and V denotes vision-indispensable.
According to statistics on RBench-V, the benchmark spans 4 categories, which are math, physics, counting and game.
It features 803 questions centered on multi-modal outputs, which requires image manipulation, such as generating novel images and constructing auxiliary lines to support reasoning process.

Comparison with different benchmarks such as MMLU, MMMU, etc. RBench-V assesses models' ability to generate multi-modal outputs during visual reasoning.
Solving problems in RBench-V requires producing outputs beyond text, such as drawing geometric figures (top-right example) or tracing path through a maze (bottom-right example).

Introduction

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini and o3 with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking process (a.k.a., multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning process while neglecting the importance of reasoning through multi-modal outputs.

In this paper, we present a benchmark, dubbed as RBench-V, designed to assess models’ multi-modal reasoning. To conduct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting and games. Unlike problems in previous benchmarks, which typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which requires image manipulation, such as generating novel images and constructing auxiliary lines to support reasoning process.

We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, which shows current models struggle to leverage multi-modal reasoning.

Leaderboard for Open- and Closed-Source Models

Overview

Leaderboard

Model	Source	Overall	w/o Math	Math	Physics	Counting	Game
Human Expert 👑	/	82.3	81.7	84.7	69.4	81.0	89.1
OpenAI o3 🥇	Link	25.8	19.5	48.3	20.4	22.1	17.1
OpenAI o4-mini 🥈	Link	20.9	14.6	43.2	12.7	17.4	13.8
Gemini 2.5 pro-preview-0506 🥉	Link	20.2	13.9	42.6	9.6	19.0	12.7
Doubao-1.5-thinking-pro-m	Link	17.1	11.0	38.6	13.4	9.7	10.5
OpenAI o1	Link	16.2	11.0	34.7	5.7	12.3	13.1
Doubao-1.5-vision-pro	Link	15.6	11.5	30.1	8.9	12.8	12.0
OpenAI GPT-4o-20250327	Link	14.1	11.2	24.4	3.2	13.3	14.2
OpenAI GPT-4.1	Link	13.6	11.7	20.5	5.7	11.3	15.3
Step-R1-V-Mini	Link	13.2	8.8	29.0	6.4	10.3	9.1
OpenAI GPT-4.5	Link	12.6	11.0	18.2	2.5	11.8	15.3
Claude-3.7-sonnet	Link	11.5	9.1	19.9	3.8	8.7	12.4
QVQ-Max	Link	11.0	8.1	21.0	5.7	6.2	10.9
Qwen2.5VL-72B	Link	10.6	9.2	15.3	3.8	6.2	14.5
InternVL-3-38B	Link	10.0	7.2	20.5	0.6	5.1	12.4
Qwen2.5VL-32B	Link	10.0	6.4	22.7	2.5	4.1	10.2
MiniCPM-2.6-o	Link	9.7	7.5	17.6	1.3	3.6	13.8
Llama4-Scout (109B MoE)	Link	9.5	6.9	18.8	3.2	4.1	10.9
MiniCPM-2.6-V	Link	9.1	7.2	15.9	1.3	6.2	11.3
LLaVA-OneVision-72B	Link	9.0	8.9	9.1	4.5	4.6	14.5
DeepSeek-VL2	Link	9.0	7.0	15.9	0.6	5.6	11.6
LLaVA-OneVision-7B	Link	8.5	6.8	14.2	2.5	4.6	10.9
Qwen2.5VL-7B	Link	8.3	7.0	13.1	2.5	3.6	12.0
InternVL-3-8B	Link	8.2	6.0	15.9	1.9	5.6	8.7
InternVL-3-14B	Link	8.0	7.0	11.4	1.3	5.1	11.6
Qwen2.5-Omni-7B	Link	7.7	4.5	11.4	1.9	2.1	7.7

The values in the table represent the Top-1 accuracy, in %

RBench-V

Examples

Examples of o3's responses to math and game questions in RBench-V:
Left: o3 solves a math question in RBench-V by converting a geometry problem into algebra using coordinates,
unlike humans who use geometric reasoning.
Right: o3 fails a game question by not following instructions to draw required connections, as highlighted in blue.

BibTeX

        
          @inproceedings{
            guo2025rbenchv,
            title={RBench-V: A Primary Assessment for Visual Reasoning Models 
              with Multi-modal Outputs},
            author={Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, 
              Pei-Lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, 
              Zhengyang Geng, Houwen Peng, Han Hu, Shi-Min Hu},
            year={2025},
            eprint={2505.16770},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2505.16770}, 
          }

Acknowledgement

We would like to thank MathVision, MathVista for this website, which is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.