🤗 Hugging Face • ⭕️ WiseModel • 🌐 PKU-KCL • 🤖 Demo
Our work is based on the following papers:
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model (CVPR 2024) [Paper] [Code]
Chaoya Jiang, Haiyang Xu, Mengfan Dong, Jiaxing Chen, Wei Ye*, Ming Yan, Qinghao Ye, Ji Zhang, Fei Huang, Shikun Zhang. (*Corresponding Author)
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models. (Under review) [Paper] [Code]
Chaoya Jiang, Wei Ye*, Mengfan Dong, Hongrui Jia, Haiyang Xu, Ming Yan, Ji Zhang, Shikun Zhang. (*Corresponding Author)
Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)
- [3/2] 🔥 We will release the pretraining and finetuneing datasets.
- [2/27] 🔥 We have released the model weights of Shell-V.
- [2/27] 🔥 Our paper "Hallucination Augmented Contrastive Learning for Multimodal Large Language Model" is accepted by CVPR 2024.
- [2/27] 🔥 We have released the training and finetuning code of Shell-V.
Our model, Shell-V, contructed based on the structure of LLaVA1.5 and Large Language Model Shell, has further undertaken targeted enhancement in terms of representational learning (Hallucination Augumented Contrastive Learning) and Self Instruction Finetuning of LVLMs, which hones multi-modal hallucination (Kindly refer to our above papers for comprehensive details). Empirical evidence corroborates that Shell-V possesses the capability to effectively mitigate hallucination. It has attained state-of-the-art performances across multiple multimodal hallucination evaluation benchmarks (such MMhal-Eval, Hal-Eval, POPE).
- Clone this repository and navigate to shell_v folder
git clone https://github.com/WisdomShell/shell-v.git
cd shell-v
- Install Package
conda create -n shell-v python=3.10 -y
conda activate shell-v
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation
git pull
pip uninstall transformers
pip install -e .
Please check out our Model Zoo for all public checkpoints, and the instructions of how to use the weights.
To run our demo, you need to prepare shell-v checkpoints locally.
To launch a Gradio demo locally, please run the following commands one by one. If you plan to launch multiple model workers to compare between different checkpoints, you only need to launch the controller and the web server ONCE.
python -m shell_v.serve.controller --host 0.0.0.0 --port 10000
python -m shell_v.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
You just launched the Gradio web interface. Now, you can open the web interface with the URL printed on the screen. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.
This is the actual worker that performs the inference on the GPU. Each worker is responsible for a single model specified in --model-path
.
python -m shell_v.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path shell_v-7b
Wait until the process finishes loading the model and you see "Uvicorn running on ...". Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.
You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the --controller
the same, and modify the --port
and --worker
to a different port number for each worker.
python -m shell_v.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port <different from 40000, say 40001> --worker http://localhost:<change accordingly, i.e. 40001> --model-path <ckpt2>
If you are using an Apple device with an M1 or M2 chip, you can specify the mps device by using the --device
flag: --device mps
.
If the VRAM of your GPU is less than 24GB (e.g., RTX 3090, RTX 4090, etc.), you may try running it with multiple GPUs. Our latest code base will automatically try to use multiple GPUs if you have more than one GPU. You can specify which GPUs to use with CUDA_VISIBLE_DEVICES
. Below is an example of running with the first two GPUs.
CUDA_VISIBLE_DEVICES=0,1 python -m shell_v.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path shell_v-v1.5-13b
You can launch the model worker with quantized bits (4-bit, 8-bit), which allows you to run the inference with reduced GPU memory footprint, potentially allowing you to run on a GPU with as few as 12GB VRAM. Note that inference with quantized bits may not be as accurate as the full-precision model. Simply append --load-4bit
or --load-8bit
to the model worker command that you are executing. Below is an example of running with 4-bit quantization.
python -m shell_v.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path shell_v-v1.5-13b --load-4bit
You can launch the model worker with LoRA weights, without merging them with the base checkpoint, to save disk space. There will be additional loading time, while the inference speed is the same as the merged checkpoints. Unmerged LoRA checkpoints do not have lora-merge
in the model name, and are usually much smaller (less than 1GB) than the merged checkpoints (13G for 7B, and 25G for 13B).
To load unmerged LoRA weights, you simply need to pass an additional argument --model-base
, which is the base LLM that is used to train the LoRA weights. You can check the base LLM of each LoRA weights in the model zoo.
python -m shell_v.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path shell_v-v1-0719-336px-lora-vicuna-13b-v1.3 --model-base lmsys/vicuna-13b-v1.3
Chat about images using shell_v without the need of Gradio interface. It also supports multiple GPUs, 4-bit and 8-bit quantized inference. With 4-bit quantization, for our shell_v-1.5-7B, it uses less than 8GB VRAM on a single GPU.
python -m shell_v.serve.cli \
--model-path shell_v-v1.5-7b \
--image-file "https://shell_v-vl.github.io/static/images/view.jpg" \
--load-4bit
shell_v is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the per_device_train_batch_size
and increase the gradient_accumulation_steps
accordingly. Always keep the global batch size the same: per_device_train_batch_size
x gradient_accumulation_steps
x num_gpus
.
We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
shell_v-7B | 256 | 1e-3 | 1 | 2048 | 0 |
- Finetuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
shell_v-7B | 128 | 2e-5 | 1 | 2048 | 0 |
Before you start, prepare our base model Shell-chat, which is an instruction-tuned chatbot. Please download its weights here.
Please download the subset of the CC3M dataset we use in the paper here.
Pretrain takes around 4 hours for shell_v-13B on 8x A100 (80G). It takes around 2 hours for 7B checkpoints.
We recommend training with DeepSpeed as it can save a lot of GPU RAM. We provide training script with DeepSpeed here.
You may run this with a single A100 GPU with the following code. Please note that the per_device_train_batch_size
* gradient_accumulation_steps
should be equal to 128 to keep the global batch size the same.
Pretrain: shell_v-13B, 1x A100 (80G). Time: ~33 hours.
python shell_v/train/train_mem.py \
--model_name_or_path ./checkpoints/vicuna-13b \
--version [v0 or v1] \
--data_path /path/to/cc3m_595k.json \
--image_folder /path/to/cc3m_595k_images \
--vision_tower openai/clip-vit-large-patch14 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/shell_v-13b-pretrain \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to wandb
- Prepare data
Please download the annotation of our instruction tuning data shell_v_instruct_890k.json, and download the images.
- Start training!
You may download our pretrained projectors in Model Zoo. It is not recommended to use legacy projectors, as they may be trained with a different version of the codebase, and if any option is off, the model will not function/train as we expected.
When we initially released our paper, we used a full 3-epoch schedule on the shell_v-Instruct-158K dataset. The scripts are provided here.
In our later exploration, we introduced shell_v-Lightning, as we find that a much faster 1-epoch schedule on shell_v-Instruct-80K can achieve fast convergence and good performance. With shell_v Lightning, we are able to train, validate, and release shell_v-LLaMA-2 checkpoints preview on the same day as LLaMA-2 release. If you are interested to learn more about shell_v Lightning, please continue to the following section.
Our GPT-assisted evaluation pipeline for multimodal modeling is provided for a comprehensive understanding of the capabilities of vision-language models. Please see our paper for more details.
- Generate shell_v responses
python model_vqa.py \
--model-path ./checkpoints/shell_v-13B-v0 \
--question-file \
playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
--image-folder \
/path/to/coco2014_val \
--answers-file \
/path/to/answer-file-our.jsonl
- Evaluate the generated responses. In our case,
answer-file-ref.jsonl
is the response generated by text-only GPT-4 (0314), with the context captions/boxes provided.
OPENAI_API_KEY="sk-***********************************" python shell_v/eval/eval_gpt_review_visual.py \
--question playground/data/coco2014_val_qa_eval/qa90_questions.jsonl \
--context shell_v/eval/table/caps_boxes_coco2014_val_80.jsonl \
--answer-list \
/path/to/answer-file-ref.jsonl \
/path/to/answer-file-our.jsonl \
--rule shell_v/eval/table/rule.json \
--output /path/to/review.json
- Summarize the evaluation results
python summarize_gpt_review.py
Please check out the documentation here.
If you find shell_v useful for your research and applications, please cite using this BibTeX:
@misc{jiang2024hallucination,
title={Hallucination Augmented Contrastive Learning for Multimodal Large Language Model},
author={Chaoya Jiang and Haiyang Xu and Mengfan Dong and Jiaxing Chen and Wei Ye and Ming Yan and Qinghao Ye and Ji Zhang and Fei Huang and Shikun Zhang},
year={2024},
eprint={2312.06968},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{jiang2024haleval,
title={Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models},
author={Chaoya Jiang and Wei Ye and Mengfan Dong and Hongrui Jia and Haiyang Xu and Ming Yan and Ji Zhang and Shikun Zhang},
year={2024},
eprint={2402.15721},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={arXiv:2304.08485},
year={2023},
}
For future project ideas, please check out:
- SEEM: Segment Everything Everywhere All at Once
- Grounded-Segment-Anything to detect, segment, and generate anything by marrying Grounding DINO and Segment-Anything.
The community's use of the CodeShell model must adhere to the "Shell-V Model License Agreement" and the Apache 2.0 license. Shell-V is permitted for commercial use. However, if you plan to use the Shell-V model or its derivative products for commercial purposes, you must confirm that the entity meets the following conditions:
- The daily average active user count (DAU) of the affiliated party's service or product cannot exceed 1 million.
- The affiliated party must not be a software service provider or cloud service provider.
- There is no possibility for the affiliated party to re-license the granted commercial license to another third party without proper authorization.
Under the aforementioned conditions, you need to submit the application materials required by the "Shell-V Model License Agreement" by sending an email to [email protected]. After approval, you will be granted a global, non-exclusive, non-transferable, non-sublicensable commercial copyright license.