This repository contains the code and data for the paper "VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information".
VisOnlyQA is designed to evaluate the visual perception capability of large vision language models (LVLMs) on geometric information of scientific figures. The evaluation set includes 1,200 mlutiple choice questions in 12 visual perception tasks on 4 categories of scientific figures. We also provide a training dataset consisting of 70k instances.
- Datasets:
- VisOnlyQA is available at VLMEvalKit π₯π₯π₯
- VisOnlyQA in VLMEvalKit is different from the original one. Refer to this section for details.
- Hugging Face
- VisOnlyQA is available at VLMEvalKit π₯π₯π₯
- Code: https://github.com/psunlpgroup/VisOnlyQA
@misc{kamoi2024visonlyqa,
title={VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information},
author={Ryo Kamoi and Yusen Zhang and Sarkar Snigdha Sarathi Das and Ranran Haoran Zhang and Rui Zhang},
year={2024},
journal={arXiv preprint arXiv:2412.00947}
}
VisOnlyQA is provided in two formats: VLMEvalKit and Hugging Face Dataset. You can use either of them to evaluate your models and report the results in your papers. However, when you report the results, please explicitly mention which version of the dataset you used because the two versions are different.
VLMEvalKit provides one-command evaluation. However, VLMEvalKit is not designed to reproduce the results in the paper. We welcome using it to report the results on VisOnlyQA in your papers, but please explicitly mention that you used VLMEvalKit.
The major differences are:
- VisOnlyQA on VLMEvalKit does not include the
chemistry__shape_multi
split - VLMEvalKit uses different prompts and postprocessing.
Refer to this document for the installation and setup of VLMEvalKit. After setting up the environment, you can evaluate any supported models on VisOnlyQA with the following command (this example is for InternVL2-26B).
python run.py --data VisOnlyQA-VLMEvalKit --model InternVL2-26B
The original VisOnlyQA dataset is provided in Hugging Face Dataset. If you want to reproduce the results in our paper, please use this version and code in the GitHub repository.
- Eval-Real: https://huggingface.co/datasets/ryokamoi/VisOnlyQA_Eval_Real
- 500 instances for questions on figures in existing datasets (e.g., MathVista, MMMU, and CharXiv)
- Eval-Synthetic: https://huggingface.co/datasets/ryokamoi/VisOnlyQA_Eval_Synthetic
- 700 instances for questions on synthetic figures
- Train: https://huggingface.co/datasets/ryokamoi/VisOnlyQA_Train
- 70,000 instances for training (synthetic figures)
dataset folder of the GitHub repository includes identical datasets, except for the training data.
from datasets import load_dataset
real_eval = load_dataset("ryokamoi/VisOnlyQA_Eval_Real")
real_synthetic = load_dataset("ryokamoi/VisOnlyQA_Eval_Synthetic")
# Splits
print(real_eval.keys())
# dict_keys(['geometry__triangle', 'geometry__quadrilateral', 'geometry__length', 'geometry__angle', 'geometry__area', 'geometry__diameter_radius', 'chemistry__shape_single', 'chemistry__shape_multi', 'charts__extraction', 'charts__intersection'])
print(real_synthetic.keys())
# dict_keys(['syntheticgeometry__triangle', 'syntheticgeometry__quadrilateral', 'syntheticgeometry__length', 'syntheticgeometry__angle', 'syntheticgeometry__area', '3d__size', '3d__angle'])
# Prompt
print(real_eval['geometry__triangle'][0]['prompt_no_reasoning'])
# There is no triangle ADP in the figure. True or False?
# A triangle is a polygon with three edges and three vertices, which are explicitly connected in the figure.
# Your response should only include the final answer (True, False). Do not include any reasoning or explanation in your response.
# Image
print(real_eval['geometry__triangle'][0]['decoded_image'])
# <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=103x165 at 0x7FB4F83236A0>
# Answer
print(real_eval['geometry__triangle'][0]['answer'])
# False
Each instance of VisOnlyQA dataset has the following attributes:
decoded_image
: [PIL.Image] Input imagequestion
: [string] Question (without instruction)prompt_reasoning
: [string] Prompt with intstruction to use chain-of-thoughtprompt_no_reasoning
: [string] Prompt with intstruction not to use chain-of-thoughtanswer
: [string] Correct answer (e.g.,True
,a
)
image_path
: [string] Path to the image fileimage_category
: [string] Category of the image (e.g.,geometry
,chemistry
)question_type
: [string]single_answer
ormultiple answers
task_category
: [string] Category of the task (e.g.,triangle
)response_options
: [List[string]] Multiple choice options (e.g.,['True', 'False']
,['a', 'b', 'c', 'd', 'e']
)source
: [string] Source datasetid
: [string] Unique ID
Core directories and files of this repository:
.
βββ dataset # VisOnlyQA dataset (Eval-Real, Eval-Synthetic)
βββ results
βΒ Β βββ model_resposnes # Responses from LVLMs on VisOnlyQA
βΒ Β βββ evaluation_metrics # Accuracy
βΒ Β βββ tables # Tables in the paper
βΒ Β βββ figures # Figures in the paper
βΒ Β βββ analysis # Analysis of the results
βββ setup
βΒ Β βββ setup.sh # Run this script to setup the environment
βββ shell # Shell scripts for reproducing our experiments
βββ src # Source code
βββ config # Main config file is in src/config.py
βββ finetuning_results # Log files of the fine-tuning experiments
bash setup.sh
We run our experiments on the following environment. You might need to modify configulations if you run our code on a different environment.
- Eight NVIDIA A100 SXM4 80GB GPUs
- Driver Version: 550.54.15
- CUDA Version: 12.4
Please refer to the shell scripts in the shell/4_evaluation folder.
# for small open LVLMs
bash shell/4_evaluation/evaluation_open_small.sh
We fine-tuned the following LVLMs on VisOnlyQA-Train.
Our fine-tuning code is based on the code provided by the authors of the models. Please refer to the shell scripts in the shell/3_training folder for details.
bash shell/3_training/train_internvl2_4B.sh
Datasets are provided in the dataset folder and at Hugging Face Datasets. You do not need to run the dataset creation code to use the datasets.
If you are interested in reproducing the dataset creation process, follow the instructions below.
If you are interested in reproducing the annotation interface: We use Google Spreadsheet for annotation. You need to set up Google API Credentials.
- Follow the instructions at https://pythonhosted.org/PyDrive/quickstart.html#authentication.
- Follow the instructions at https://docs.gspread.org/en/latest/oauth2.html.
- Download the credential file at credentials/google_spreadsheet_credential.json.
- Put your Google Account (email address) in credentials/google_sccount_email.txt.
conda activate visonlyqa
export HF_ACCOUNT="your_hugging_face_account" # dataset will be created in your HF account as private datasets
export CONDA_SH="~/anaconda3/etc/profile.d/conda.sh" # set your anaconda path
Refer to the shell files in shell/1_train_dataset_creation and shell/2_evaluation_dataset_creation.
Please refer to LICENSE.md.
If you have any questions, feel free to open an issue or reach out directly to Ryo Kamoi ([email protected]).