Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

This repository contains the official pytorch implementation of the paper: "Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models".

Updates

2024.05.29: Build project page
2024.05.29: Paper online
2024.05.28: Code release

Observation

Attention bias in LVLMs. Even when the image (V) does not contain information relevant to the query (Q), LVLMs exhibit a tendency for attention to be biased towards a few image tokens (i.e., blind tokens). This phenomenon is observed by averaging the attention weights across all layers when generating the first response token.

Motivation

Impact of blind/non-blind tokens on prediction logits. (Left) Zeroing out image tokens with attention weights higher than the mean + standard deviation, i.e., blind tokens, does not significantly affect the original prediction logits, suggesting that LVLMs may assign high attention weights to tokens that do not carry significant object-discriminative information. Conversely, zeroing out non-blind tokens drastically disrupts the logits, often leading to near 50:50 probabilities, indicating a loss of object-discriminative information. (Right) Similarly, examples demonstrate that zeroing out non-blind tokens results in a loss of discriminative power for previously well-classified instances or produces entirely incorrect predictions, causing a significant drop in performance.

Method: AvisC

Setup

conda create AvisC python=3.10
conda activate AvisC
git clone https://github.com/sangminwoo/AvisC.git
cd AvisC
pip install -r requirements.txt

Models

About model checkpoints preparation

LLaVA-1.5: Download LLaVA-1.5 merged 7B
InstructBLIP: Download InstructBLIP

Evaluation

POPE: bash eval_bench/scripts/pope_eval.sh
- Need to specify "model", "model_path"
MME: bash experiments/cd_scripts/mme_eval.sh
- Need to specify "model", "model_path"
AMBER: bash experiments/cd_scripts/amber_eval.sh
- Need to specify "model", "model_path"

About datasets preparation

Please download and extract the MSCOCO 2014 dataset from this link to your data path for evaluation.
For MME evaluation, see this link.
For AMBER evaluation, see this link.

Results

POPE

MME

MME-Fullset

MME-Hallucination

AMBER

LLaVA-Bench Examples

Acknowledgments

This codebase borrows from most notably VCD, OPERA, and LLaVA. Many thanks to the authors for generously sharing their codes!

Citation

If you find this repository helpful for your project, please consider citing our work :)

@article{woo2024dont,
  title={Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models}, 
  author={Woo, Sangmin and Kim, Donguk and Jang, Jaehyuk and Choi, Yubin and Kim, Changick},
  journal={arXiv preprint arXiv:2405.17820},
  year={2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
avisc_utils		avisc_utils
eval_bench		eval_bench
experiments		experiments
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

Updates

Observation

Motivation

Method: AvisC

Setup

Models

Evaluation

Results

POPE

MME

AMBER

LLaVA-Bench Examples

Acknowledgments

Citation

About

Releases

Packages

Languages

License

sangminwoo/AvisC

Folders and files

Latest commit

History

Repository files navigation

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

Updates

Observation

Motivation

Method: AvisC

Setup

Models

Evaluation

Results

POPE

MME

AMBER

LLaVA-Bench Examples

Acknowledgments

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages