Welcome to Griffon

This is the official repository for the Griffon series (v1, v2, and G). Griffon is the first high-resolution (over 1K) LVLM capable of performing fine-grained visual perception tasks, such as object detection and counting. In its latest version, Griffon integrates vision-language and vision-centric tasks within a unified end-to-end framework. You can interact with Griffon and request it to complete various tasks. The model is continuously evolving towards greater intelligence to handle more complex scenarios. Feel free to follow Griffon and reach out to us by raising an issue.

Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models（Latest）

📕Paper 🌀Usage 🤗Model(NEW)

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring

📕Paper 🌀Intro

Griffon: Spelling out All Object Locations at Any Granuality with Large Language Model

📕Paper 🌀Usage 🤗Model

Release

2025.01.15 🔥Release the evaluation scripts supporting distributed inference.
2024.11.26 🔥We are glad to release inference code and the model of Griffon-G in 🤗Griffon-G. Training codes will be released later.
2024.07.01 🔥Griffon has been accepted to ECCV 2024. Data is released in 🤗HuggingFace
2024.03.11 🔥We are excited to announce the arrival of Griffon v2. Griffion v2 brings fine-grained perception performance to new heights with high-resolution expert-level detection and counting and supports visual-language co-referring. Take a look at our demo first. Paper is preprinted in 📕Arxiv.
2023.12.06 🔥Release the Griffon v1 inference code and model in 🤗HuggingFace.
2023.11.29 🔥Griffon v1 Paper has been released in 📕Arxiv.

What can Griffon do now?

Griffon-G demonstrates advanced performance across multimodal benchmarks, general VQAs, and text-rich VQAs, achieving new state-of-the-art results in REC and object detection. More quantitative evaluation results can be found in our paper.

Get Started

1.Clone & Install

git clone [email protected]:jefferyZhan/Griffon.git
cd Griffon
pip install -e .

Tips: If you encounter any errors while installing the packages, you can always download the corresponding source files (*.whl), which have been verified by us.

2.Download the Griffon and CLIP models to the checkpoints folder.

Model	Links
Griffon-G-9B	`🤗HuggingFace`
Griffon-G-27B	`🤗HuggingFace`
clip-vit-large-path14	`🤗HuggingFace`
clip-vit-large-path14-336_to_1022	`🤗HuggingFace`

3.Inference

# 3.1 Modify the instruction in the run_inference.sh.

# 3.2.1 DO NOT USE Visual Prompt
bash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH]

# 3.2.2 USE Visual Prompt for COUNTING: Input both query image and prompt image splited with comma and specify <region> placeholder in the instruction
bash run_inference.sh [CUDA_ID] [CHECKPOINTS_PATH] [IMAGE_PATH,PROMPT_PATH]

Notice: Please pay attention to the singular and plural expressions of objects.

4.Evaluation

4.1 Multimodal Benchmark Evaluation

Please Refer to LLaVA Evaluation or Use VLMEvalKit.

4.2 COCO Detection Evaluation

# Single Node
torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 12457 -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json

# Multiple Node
## NODE 0
torchrun --nproc_per_node 8 --nnodes N --node_rank 0 --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json --init tcp://MASTER_ADDR:MASTER_PORT
## NODE K(1 to N-1)
torchrun --nproc_per_node 8 --nnodes N --node_rank K --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/coco2017/val2017 --dataset PATH/TO/instances_val2017.json --init tcp://MASTER_ADDR:MASTER_PORT

4.3 REC Evaluation

Processed RefCOCO annotation set can be downloaded from this link.

# Single Node
torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr 127.0.0.1 --master_port 12457 -m griffon.eval.eval_rec --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN

# Multiple Node
## NODE 0
torchrun --nproc_per_node 8 --nnodes N --node_rank 0 --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN --init tcp://MASTER_ADDR:MASTER_PORT
## NODE K(1 to N-1)
torchrun --nproc_per_node 8 --nnodes N --node_rank K --master_addr MASTER_ADDR --master_port MASTER_PORT -m griffon.eval.eval_detection --model-path PATH/TO/MODEL --image-folder PATH/TO/COCO/train2014 --dataset PATH/TO/REF_COCO_ANN --init tcp://MASTER_ADDR:MASTER_PORT

Acknowledgement

LLaVA provides the base codes and pre-trained models.
Shikra provides insight of how to organize datasets and some base processed annotations.
Llama provides the large language model.
Gemma2 provides the large language model.
volgachen provides the basic environment setting config.

Citation

If you find Griffon useful for your research and applications, please cite using this BibTeX:

@inproceedings{zhan2025griffonv1,
  title={Griffon: Spelling out all object locations at any granularity with large language models},
  author={Zhan, Yufei and Zhu, Yousong and Chen, Zhiyang and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  booktitle={European Conference on Computer Vision},
  pages={405--422},
  year={2025},
  organization={Springer}
}

@misc{zhan2024griffonv2,
      title={Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring}, 
      author={Yufei Zhan and Yousong Zhu and Hongyin Zhao and Fan Yang and Ming Tang and Jinqiao Wang},
      year={2024},
      eprint={2403.09333},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{zhan2024griffon-G,
  title={Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models},
  author={Zhan, Yufei and Zhao, Hongyin and Zhu, Yousong and Yang, Fan and Tang, Ming and Wang, Jinqiao},
  journal={arXiv preprint arXiv:2410.16163},
  year={2024}
}

License

The data and checkpoint is licensed for research use only. All of them are also restricted to uses that follow the license agreement of LLaVA, LLaMA, Gemma2, and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
demo		demo
docs		docs
griffon		griffon
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Times New Roman.ttf		Times New Roman.ttf
pyproject.toml		pyproject.toml
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Griffon

Release

What can Griffon do now?

Get Started

1.Clone & Install

2.Download the Griffon and CLIP models to the checkpoints folder.

3.Inference

4.Evaluation

Acknowledgement

Citation

License

About

Releases

Packages

Languages

License

jefferyZhan/Griffon

Folders and files

Latest commit

History

Repository files navigation

Welcome to Griffon

Release

What can Griffon do now?

Get Started

1.Clone & Install

2.Download the Griffon and CLIP models to the checkpoints folder.

3.Inference

4.Evaluation

Acknowledgement

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages