R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM \& MLLM Complex Reasoning Evaluation

R denotes reasoning. According to statistics on RBench, the benchmark spans 19 departments, including mathematics, physics, biology, computer science, and chemistry, covering over 100 subjects such as Inorganic Chemistry, Chemical Reaction Kinetics, and Electromagnetism. It features 1,094 questions designed for testing language models and 665 questions specifically tailored for evaluating multimodal reasoning capabilities.

Comparison with different benchmarks such as MMLU, MMMU, etc. RBench has a higher reasoning difficulty, even reaching the o1 saturation similar to the competition data AIME@2024.

Reasoning stands as a cornerstone of intelligence, enabling the synthesis of existing knowledge to solve complex problems. Despite remarkable progress, existing reasoning benchmarks often fail to rigorously evaluate the nuanced reasoning capabilities required for complex, real-world problem-solving, particularly in multi-disciplinary and multimodal contexts.

In this paper, we introduce a graduate-level, multi-disciplinary, English-Chinese benchmark, dubbed as Reasoning Bench (RBench), for assessing the reasoning capability of both language and multimodal models. It spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing in both English and Chinese. These questions are meticulously curated to ensure rigorous difficulty calibration, subject balance, and cross-linguistic alignment, enabling the assessment to be an Olympiad-level multi-disciplinary benchmark.

We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc. Experimental results indicate that advanced models perform poorly on complex reasoning, especially multimodal reasoning. Even the top-performing model OpenAI o1 achieves only 53.2\% accuracy on our multimodal evaluation.

Leaderboard for Language Model

Model	Source	Date	Average	RBench-T	RBench-T (zh)
OpenAI o1 🥇	Link	2024-12-17	69.6	69.0	70.1
Gemini2.0-Flash-Thinking 🥈	Link	2025-01-21	68.0	68.4	67.5
Doubao1.5Pro 🥉	Link	2025-01-21	62.7	62.0	63.4
o1-Preview	Link	2024-09-12	62.5	62.3	62.6
o1-mini	Link	2024-09-12	62.0	64.0	59.9
JoyAI1.0	Link	2025-04-22	61.5	63.2	59.7
Doubao-Pro	Link	2024-12-15	60.8	60.7	60.8
DeepSeek-R1	Link	2025-01-22	60.3	61.2	59.3
DeepSeek-V3	Link	2024-12-26	58.1	59.6	56.6
Claude3.5-sonnet	Link	2024-06-20	57.4	57.5	57.3
MiniMax-Text-01	Link	2025-1-14	53.7	53.8	53.6
Qwen2.5-72B	Link	2024-09-19	52.9	53.7	52.0
GPT-4o	Link	2024-11-20	52.6	53.6	51.6
GLM-Zero-Preview	Link	2024-12-31	51.1	53.6	48.6
Qwen2.5-32B	Link	2024-09-19	50.4	50.8	49.9
Qwen2.5-7B	Link	2024-09-19	44.1	43.6	44.5
Llama-3.1-8B	Link	2024-07-23	24.9	26.1	23.6

-T: Text-only questions for language models test. `zh' indicates the Chinese version.
The values in the table represent the Top-1 accuracy, in %

Leaderboard for Multimodal Model

Model	Source	Date	Average	RBench-M	RBench-M (zh)
OpenAI o1 🥇	Link	2024-12-17	53.1	53.2	53.0
Doubao1.5Pro 🥈	Link	2025-01-21	40.2	37.9	42.4
Claude-3-5-sonnet 🥉	Link	2025-04-10	39.0	39.7	38.3
Gemini1.5-Pro	Link	2024-2-15	35.7	35.9	35.5
GPT-4o	Link	2024-11-20	33.3	33.4	33.2
Qwen2.5-72B	Link	2024-09-19	25.4	25.1	25.7
LLaVA-OneVision-7B	Link	2024-08-06	23.7	23.8	23.5
DeepSeek-VL2	Link	2024-12-13	23.1	21.8	24.4
Qwen2.5-7B	Link	2024-09-19	21.0	19.6	22.3
Llama3.2V-11B-Instruct	Link	2024-09-25	19.3	20.0	18.6

-M: Multimodal questions for language models test. `zh' indicates the Chinese version.
The values in the table represent the Top-1 accuracy, in %

Examples of RBench. : These examples show that RBench is multidisciplinary, multimodal, and multilingual. As shown in the figure, the problems in RBench are complex and cannot be solved by quick thinking, which shows that RBench focuses on deep reasoning problems rather than knowledge problems, such as conceptual problems.

BibTeX

        @inproceedings{
          guo2025rbench,
          title={RBench: Graduate-level Multi-disciplinary Benchmarks for
            LLM & MLLM Complex Reasoning Evaluation},
          author={Meng-Hao Guo, Jiajun Xu, Yi Zhang, Jiaxi Song, Haoyang Peng, Yi-Xuan Deng, 
            Xinzhi Dong, Kiyohiro Nakayama, Zhengyang Geng, Chen Wang, Bolin Ni, Guo-Wei Yang, 
            Yongming Rao, Houwen Peng, Han Hu, Gordon Wetzstein, Shi-min Hu},
            year={2025},
            eprint={2505.02018},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2505.02018}, 
          }

Acknowledgement

We would like to thank MathVision, MathVista for this website, which is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

RBench: Graduate-level Multi-disciplinary Benchmarks for
LLM & MLLM Complex Reasoning Evaluation

ICML 2025

Introduction

Leaderboard for Language Model

Leaderboard for Multimodal Model

R-Bench

Examples

BibTeX

Acknowledgement

RBench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

ICML 2025

Introduction

Leaderboard for Language Model

Leaderboard for Multimodal Model

R-Bench

Examples

BibTeX

Acknowledgement

RBench: Graduate-level Multi-disciplinary Benchmarks for
LLM & MLLM Complex Reasoning Evaluation