The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini and o3 with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence.
Systematic evaluation of their multi-modal output capabilities in visual thinking process (a.k.a., multi-modal chain of thought, M-CoT) becomes critically important.
However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning process while neglecting the importance of reasoning through multi-modal outputs.
In this paper, we present a benchmark, dubbed as RBench-V, designed to assess modelsโ multi-modal reasoning. To conduct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting and games. Unlike problems in previous benchmarks, which typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which requires image manipulation, such as generating novel images and constructing auxiliary lines to support reasoning process.
We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, which shows current models struggle to leverage multi-modal reasoning.