RBench: Graduate-level Multi-disciplinary Benchmarks for
LLM & MLLM Complex Reasoning Evaluation

ICML 2025

Meng-Hao Guo1, Jiajun Xu1, Yi Zhang1, Jiaxi Song1, Haoyang Peng1, Yi-Xuan Deng1, Xinzhi Dong1,
Kiyohiro Nakayama 3, Zhengyang Geng 4, Chen Wang5, Bolin Ni2, Guo-Wei Yang6, Yongming Rao2, Houwen Peng2,
Han Hu2, Gordon Wetzstein3, Shi-Min Hu1

1 Tsinghua University, 2 Tencent Hunyuan X, 3 Stanford University,
4 Carnegie Mellon University, 5 University of Pennsylvania, 6 Fitten Tech
geometric reasoning

R denotes reasoning. According to statistics on RBench, the benchmark spans 19 departments, including mathematics, physics, biology, computer science, and chemistry, covering over 100 subjects such as Inorganic Chemistry, Chemical Reaction Kinetics, and Electromagnetism. It features 1,094 questions designed for testing language models and 665 questions specifically tailored for evaluating multimodal reasoning capabilities.

benchmarks comparsion

Comparison with different benchmarks such as MMLU, MMMU, etc. RBench has a higher reasoning difficulty, even reaching the o1 saturation similar to the competition data AIME@2024.

Introduction

Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problem-solving, particularly in multi-disciplinary and multimodal contexts.

In this paper, we introduce a graduate-level, multi-disciplinary, English-Chinese benchmark, dubbed as Reasoning Bench (RBench), for assessing the reasoning capability of both language and multimodal models. It spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing in both English and Chinese. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and cross-linguistic alignment, enabling the assessment to be an Olympiad-level multi-disciplinary benchmark.

We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning. Even the top-performing model OpenAI o1 achieves only 53.2\% accuracy on our multimodal evaluation.

Leaderboard for Language Model

Model Source Date Average RBench-T RBench-T (zh)
OpenAI o1 🥇 Link 2024-12-17 69.6 69.0 70.1
Gemini2.0-Flash-Thinking 🥈 Link 2025-01-21 68.0 68.4 67.5
Doubao1.5Pro 🥉 Link 2025-01-21 62.7 62.0 63.4
o1-Preview Link 2024-09-12 62.5 62.3 62.6
o1-mini Link 2024-09-12 62.0 64.0 59.9
Doubao-Pro Link 2024-12-15 60.8 60.7 60.8
DeepSeek-R1 Link 2025-01-22 60.3 61.2 59.3
DeepSeek-V3 Link 2024-12-26 58.1 59.6 56.6
Claude3.5-sonnet Link 2024-06-20 57.4 57.5 57.3
MiniMax-Text-01 Link 2025-1-14 53.7 53.8 53.6
Qwen2.5-72B Link 2024-09-19 52.9 53.7 52.0
GPT-4o Link 2024-11-20 52.6 53.6 51.6
GLM-Zero-Preview Link 2024-12-31 51.1 53.6 48.6
Qwen2.5-32B Link 2024-09-19 50.4 50.8 49.9
Qwen2.5-7B Link 2024-09-19 44.1 43.6 44.5
Llama-3.1-8B Link 2024-07-23 24.9 26.1 23.6
-T: Text-only questions for language models test. `zh' indicates the Chinese version.
The values in the table represent the Top-1 accuracy, in %

Leaderboard for Multimodal Model

Model Source Date Average RBench-M RBench-M (zh)
OpenAI o1 🥇 Link 2024-12-17 53.1 53.2 53.0
Doubao1.5Pro 🥈 Link 2025-01-21 40.2 37.9 42.4
Claude-3-5-sonnet 🥉 Link 2025-04-10 39.0 39.7 38.3
Gemini1.5-Pro Link 2024-2-15 35.7 35.9 35.5
GPT-4o Link 2024-11-20 33.3 33.4 33.2
Qwen2.5-72B Link 2024-09-19 25.4 25.1 25.7
LLaVA-OneVision-7B Link 2024-08-06 23.7 23.8 23.5
DeepSeek-VL2 Link 2024-12-13 23.1 21.8 24.4
Qwen2.5-7B Link 2024-09-19 21.0 19.6 22.3
Llama3.2V-11B-Instruct Link 2024-09-25 19.3 20.0 18.6
-M: Multimodal questions for language models test. `zh' indicates the Chinese version.
The values in the table represent the Top-1 accuracy, in %

R-Bench

Examples

data-overview

Examples of RBench. : These examples show that RBench is multidisciplinary, multimodal, and multilingual. As shown in the figure, the problems in RBench are complex and cannot be solved by quick thinking, which shows that RBench focuses on deep reasoning problems rather than knowledge problems, such as conceptual problems.

BibTeX

        @inproceedings{
          guo2025rbench,
          title={RBench: Graduate-level Multi-disciplinary Benchmarks for
            LLM & MLLM Complex Reasoning Evaluation},
          author={Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, 
            Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni, Guo-Wei Yang, 
            Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-min Hu},
            year={2025},
            eprint={2505.02018},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2505.02018}, 
          }
        
      

Acknowledgement

We would like to thank MathVision, MathVista for this website, which is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.