Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion.
[Paper] [Project Page] [Demo 8B] [Checkpoint 8B]
- Install package for tranining
conda create -n florence-vl python=3.11 -y
conda activate florence-vl
pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
- Install package for evaluation (We use lmms-eval for evaluation.)
cd lmms-eval
pip install -e .
-
Pretrain Data:
Detailed Caption from PixelProse and ShareGPT4V.
-
Instruction Data:
TODO.
Set up your basic slurm information in the scripts/florence-vl/llama/llama3.sh
Then you can run pretrain and finetune job:
In scripts/florence-vl/llama/pretrain_llama.sh
, you need to manully export the following variable:
export NNODES=number of nodes
export DATA_PATH=/your/path/for/pretrain/data/json/file
export IMG=/your/image/folder
export OUTPUT=/checkpoint/save/path
In scripts/florence-vl/llama/finetune_llama.sh
, you need to manully export the following variable:
export NNODES=number of nodes
export DATA_PATH=/your/path/for/instuction/data/json/file
export IMG=/your/image/folder
export CKPT_PATH=/pretrain/checkpoint
export VIT_PATH=/pretrain/checkpoint/vision_tower
export OUTPUT=/checkpoint/save/path
We use lmms-eval for evaluation.
export OPENAI_API_KEY=your key
python -m accelerate.commands.launch \
--num_processes=4 \
-m lmms_eval \
--model llava \
--model_args pretrained="/your/model/path/,conv_template=/choose/from/llama3/or/phi" \
--tasks textvqa_val,gqa,realworldqa,vizwiz_vqa_val,pope,scienceqa_img,mmvet,mme,seedbench,hallusion_bench_image,llava_in_the_wild,mathvista_testmini,docvqa_val,ocrbench,chartqa,ai2d,mmmu_val,mmbench_en_dev,infovqa_val,mmbench_cn_dev,mmstar \
--batch_size 1 \
--log_samples \
--log_samples_suffix florence-vl \
--output_path ./logs/
- Florence-VL 8B: Pretrained Checkpoint and Instructed Checkpoint.
- Florence-VL 3B: Pretrained Checkpoint and Instructed Checkpoint.
LLaVA: We start from codebase from the amazing LLaVA.
lmms-eval: Thanks for amazing multimodal evaluation codebase.