Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 50+ advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.
🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Paper | Download | Eval Tool | ✒️ Citation
A representative evaluation benchmark for MLLMs. ✨
🔥🔥🔥 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
🔥🔥🔥 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[2024.06.03] We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟
It applies to both image MLLMs, i.e., generalizing to multiple images, and video MLLMs. Our leaderboard involes SOTA models like Gemini 1.5 Pro, GPT-4o, GPT-4V, LLaVA-NeXT-Video, InternVL-Chat-V1.5, and Qwen-VL-Max. 🌟
It includes both short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from 11 seconds to 1 hour. ✨
All data are newly collected and annotated by humans, not from any existing video dataset. ✨
🔥🔥🔥 MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
🔥🔥🔥 A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise
Paper
The first technical report for Gemini vs GPT-4V. A total of 128 pages. Completed within one week of the Gemini API opening. 🌟
- [06-06] Thanks to CMRI, JT-VL-Chat-V1.0 is added in MME.
- [05-27] Thanks to Junbo Cui, MiniCPM-Llama3-V 2.5 joins MME.
- [05-18] Thanks to Chunyu Xie, 360VL is incorporated into MME.
- [04-27] Thanks to Zhe Chen, we welcome a new member InternVL-Chat-V1.5.
- [04-15] Thanks to Junbo Cui, MiniCPM-V-2 is added in MME.
- [04-10] Thanks to Wenqiao Zhang, HyperLLaVA joins our leaderboards.
- [03-14] Thanks to Muyang He, Bunny-3B takes part in MME.
- [02-23] Thanks to Jingyu Liu, ChatTruth-7B is added to MME.
- [02-07] Thanks to TsinghuaNLP, MiniCPM and OmniLMM are incorporated into our leaderboards.
- [02-05] Thanks to Haotian Liu, LLaVA-1.6 is added to MME.
- [02-05] Thanks to Bin Lin, MoE-LLaVA joins MME.
- [02-05] Thanks to Weihan Wang and Wenyi Hong, CogVLM and CogAgent take part in MME.
- [01-25] Thanks to Shijie Wang, we welcome a new member Qwen-VL-Max.
- [01-22] Thanks to Xiaoyi Dong, InternLM-XComposer2-VL joins our leaderboards.
2023
[2023-12]
- [12-31] Thanks to Dian Li, PureMM takes part in our leaderboards (update in 2024-01-14 and 2024-01-21).
- [12-31] Thanks to Yilin Ma and Min Xu, RBDash is added in MME.
- [12-18] Thanks to Zihan Wang, our leaderboards usher in Gemini Pro.
- [12-18] Thanks to Jinze Bai, a new model Qwen-VL-Plus is added in MME.
- [12-18] Thanks to Junbum Cha, Honeybee joins our leaderboards.
- [12-12] Thanks to Yuliang Liu, Monkey-Chat takes part in MME.
- [12-12] Thanks to Junkun Yuan, we welcome a new member AGILMM.
- [12-01] Thanks to Cheng Wen, BELLE-VL is added to our leaderboards.
- [12-01] Thanks to PCI Research, TransCore-M joins MME.
[2023-11]
- [11-24] Thanks to Xiaoyi Dong, we add ShareGPT4V to our leaderboards.
- [11-24] Thanks to Muyang He, DataOptim joins MME.
- [11-24] Thanks to Zifei Shan, Kanva is added.
- [11-21] Thanks to Junke Wang, LVIS-INSTRUCT4V is added to our MME.
- [11-18] Thanks to Zhenbo Luo, our leaderboards welcome a new member CVLM.
- [11-10] Thanks to Qinghao Ye, we get a new model mPLUG-Owl2 in our leaderboards.
- [11-10] Thanks to Zhibin Wang, InfMLLM joins our leaderboards (update in 2023-12-12).
[2023-10]
- [10-29] Thanks to Jiaming Han, SPHINX is added to our leaderboards.
- [10-23] Thanks to Zihan Wang, he manually evaluate the performance of GPT-4V on our benchmark. Note that GPT-4V refuses to answer questions that involve individuals, resulting in a zero score in the Celebrity subtask.
- [10-13] Thanks to Yizhou Zhou, WeMM joins our leaderboards (The results are renewed on 2023-11-10 by updating the model).
- [10-13] Thanks to Cui Junbo, we add Muffin to our leaderboards.
- [10-13] Thanks to Jiaming Han, the results of LLaMA-Adapter V2 have been updated.
- [10-04] Thanks to Haotian Liu, the results of LLaVA have been updated.
[2023-09]
- [09-28] Thanks to Huasong Zhong, Lion is added.
- [09-27] Thanks to Xiaoyi Dong, InternLM-XComposer-VL joins our leaderboards.
- [09-05] Thanks to Jinze Bai, our leaderboards usher in Qwen-VL-Chat.
- [09-01] Thanks to Skywork Multi-Modal Group, Skywork-MM takes part in our leaderboards.
[2023-08]
- [08-28] Thanks to UCSD MLPC, we welcome BLIVA to join our leaderboards.
- [08-28] Thanks to Jianfeng Wang, GIT2 is added to our leaderboards.
- [08-28] Thanks to Yike Yuan and Songyang Zhang, the results of MiniGPT4 have been revised.
- [08-21] Thanks to Haozhe Zhao, MMICL joins our leaderboards (The results are renewed on 2023-09-17 by upgrading the checkpoint.).
- [08-13] Thanks to Zhejiang University DCD Lab, our leaderboards incorporate a new member Cheetor.
- [08-08] Thanks to Fuxiao Liu, we add LRV-Instruction to our leaderboards.
[2023-07]
- [07-28] Thanks to Yingzi Ma, his work Octopus has been updated to our leaderboards.
- [07-15] Thanks to Jiani Zheng, our leaderboards welcome a new member Lynx.
- [07-12] Thanks to Ao Zhang, his work VPGTrans has been added in our leaderboards.
- [07-09] Thanks to Bo Li, we have updated the evaluation of his work Otter. It uses the latest model OTTER-Image-MPT7B that incoporates OpenFlamingv2 and enhances instruction following ability.
[2023-06]
- [06-30] Thanks to Renrui Zhang, we have updated the evaluation of his two works, i.e., LLaMA-Adapter V2 and ImageBind_LLM. The former is re-evaluated after changing the model weights, and the latter is a newly added MLLM.
- [06-30] Thanks to Gen Luo, we have added the evaluation of his work LaVIN.
- [06-30] The results of other models have also been updated, retrieving the answer from the beginning of the generated responses instead of the whole responses. An automated evaluation script for the calculation of scores has been released!
Results of Available Models [Unavailable Version]
Leaderboards of Available Models [Unavailable Version]
Sum of the scores of all perception subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, and OCR. The full score of each subtask is 200, and that of all perception is 2000.
Sum of the scores of all cognition subtasks, including commonsense reasoning, numerical calculation, text translation, and code reasoning. The full score of each subtask is 200, and that of all cognition is 800.