RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

RBench-V Logo
Meng-Hao Guo1, Xuanyu Chu1, Qianrui Yang1, Zhe-Han Mo1, Yiqing Shen1
Pei-Lin Li1, Xinjie Lin1, Jinnian Zhang2, Xin-Sheng Chen1, Yi Zhang1
Kiyohiro Nakayama3, Zhengyang Geng4, Houwen Peng2, Han Hu2, Shi-Min Hu1


1 Tsinghua University, 2 Tencent Hunyuan X, 3 Stanford University, 4 Carnegie Mellon University
geometric reasoning

R denotes reasoning, and V denotes vision-indispensable.
According to statistics on RBench-V, the benchmark spans 4 categories, which are math, physics, counting and game.
It features 803 questions centered on multi-modal outputs, which requires image manipulation, such as generating novel images and constructing auxiliary lines to support reasoning process.

benchmarks comparsion

Comparison with different benchmarks such as MMLU, MMMU, etc. RBench-V assesses models' ability to generate multi-modal outputs during visual reasoning.
Solving problems in RBench-V requires producing outputs beyond text, such as drawing geometric figures (top-right example) or tracing path through a maze (bottom-right example).

Introduction

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini and o3 with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking process (a.k.a., multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning process while neglecting the importance of reasoning through multi-modal outputs.

In this paper, we present a benchmark, dubbed as RBench-V, designed to assess modelsโ€™ multi-modal reasoning. To conduct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting and games. Unlike problems in previous benchmarks, which typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which requires image manipulation, such as generating novel images and constructing auxiliary lines to support reasoning process.

We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, which shows current models struggle to leverage multi-modal reasoning.

Leaderboard for Open- and Closed-Source Models

Overview

compare_models

Leaderboard

Model Source Overall w/o Math Math Physics Counting Game
Human Expert ๐Ÿ‘‘ / 82.3 81.7 84.7 69.4 81.0 89.1
OpenAI o3 ๐Ÿฅ‡ Link 25.8 19.5 48.3 20.4 22.1 17.1
OpenAI o4-mini ๐Ÿฅˆ Link 20.9 14.6 43.2 12.7 17.4 13.8
Gemini 2.5 pro-preview-0506 ๐Ÿฅ‰ Link 20.2 13.9 42.6 9.6 19.0 12.7
Doubao-1.5-thinking-pro-m Link 17.1 11.0 38.6 13.4 9.7 10.5
OpenAI o1 Link 16.2 11.0 34.7 5.7 12.3 13.1
Doubao-1.5-vision-pro Link 15.6 11.5 30.1 8.9 12.8 12.0
OpenAI GPT-4o-20250327 Link 14.1 11.2 24.4 3.2 13.3 14.2
OpenAI GPT-4.1 Link 13.6 11.7 20.5 5.7 11.3 15.3
Step-R1-V-Mini Link 13.2 8.8 29.0 6.4 10.3 9.1
OpenAI GPT-4.5 Link 12.6 11.0 18.2 2.5 11.8 15.3
Claude-3.7-sonnet Link 11.5 9.1 19.9 3.8 8.7 12.4
QVQ-Max Link 11.0 8.1 21.0 5.7 6.2 10.9
Qwen2.5VL-72B Link 10.6 9.2 15.3 3.8 6.2 14.5
InternVL-3-38B Link 10.0 7.2 20.5 0.6 5.1 12.4
Qwen2.5VL-32B Link 10.0 6.4 22.7 2.5 4.1 10.2
MiniCPM-2.6-o Link 9.7 7.5 17.6 1.3 3.6 13.8
Llama4-Scout (109B MoE) Link 9.5 6.9 18.8 3.2 4.1 10.9
MiniCPM-2.6-V Link 9.1 7.2 15.9 1.3 6.2 11.3
LLaVA-OneVision-72B Link 9.0 8.9 9.1 4.5 4.6 14.5
DeepSeek-VL2 Link 9.0 7.0 15.9 0.6 5.6 11.6
LLaVA-OneVision-7B Link 8.5 6.8 14.2 2.5 4.6 10.9
Qwen2.5VL-7B Link 8.3 7.0 13.1 2.5 3.6 12.0
InternVL-3-8B Link 8.2 6.0 15.9 1.9 5.6 8.7
InternVL-3-14B Link 8.0 7.0 11.4 1.3 5.1 11.6
Qwen2.5-Omni-7B Link 7.7 4.5 11.4 1.9 2.1 7.7
The values in the table represent the Top-1 accuracy, in %

RBench-V

Examples

data-overview

Examples of o3's responses to math and game questions in RBench-V:
Left: o3 solves a math question in RBench-V by converting a geometry problem into algebra using coordinates,
unlike humans who use geometric reasoning.
Right: o3 fails a game question by not following instructions to draw required connections, as highlighted in blue.

BibTeX

        
          @inproceedings{
            guo2025rbenchv,
            title={RBench-V: A Primary Assessment for Visual Reasoning Models 
              with Multi-modal Outputs},
            author={Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, 
              Pei-Lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, 
              Zhengyang Geng, Houwen Peng, Han Hu, Shi-Min Hu},
            year={2025},
            eprint={2505.16770},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2505.16770}, 
          }
        
      

Acknowledgement

We would like to thank MathVision, MathVista for this website, which is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.