Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion.

[Paper] [Project Page] [Demo 8B] [Checkpoint 8B]

News

[2024-12-05] We release Arxiv paper, training code, checkpoint and Demo [3B, 8B]. 🤗 Have fun!

Install Environment

Install package for tranining

conda create -n florence-vl python=3.11 -y
conda activate florence-vl
pip install --upgrade pip  
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Install package for evaluation (We use lmms-eval for evaluation.)

cd lmms-eval
pip install -e .

Dataset Download

Pretrain Data:

Detailed Caption from PixelProse and ShareGPT4V.
Instruction Data:

TODO.

Training Script

Training with llama 3.1-8B (phi-3 is similar)

Set up your basic slurm information in the scripts/florence-vl/llama/llama3.sh Then you can run pretrain and finetune job:

In scripts/florence-vl/llama/pretrain_llama.sh, you need to manully export the following variable:

export NNODES=number of nodes

export DATA_PATH=/your/path/for/pretrain/data/json/file
export IMG=/your/image/folder

export OUTPUT=/checkpoint/save/path

In scripts/florence-vl/llama/finetune_llama.sh, you need to manully export the following variable:

export NNODES=number of nodes

export DATA_PATH=/your/path/for/instuction/data/json/file
export IMG=/your/image/folder

export CKPT_PATH=/pretrain/checkpoint
export VIT_PATH=/pretrain/checkpoint/vision_tower
export OUTPUT=/checkpoint/save/path

Evaluation Script

We use lmms-eval for evaluation.

export OPENAI_API_KEY=your key

python -m accelerate.commands.launch \
    --num_processes=4 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained="/your/model/path/,conv_template=/choose/from/llama3/or/phi" \
    --tasks  textvqa_val,gqa,realworldqa,vizwiz_vqa_val,pope,scienceqa_img,mmvet,mme,seedbench,hallusion_bench_image,llava_in_the_wild,mathvista_testmini,docvqa_val,ocrbench,chartqa,ai2d,mmmu_val,mmbench_en_dev,infovqa_val,mmbench_cn_dev,mmstar \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix florence-vl \
    --output_path ./logs/

Checkpoint

Florence-VL 8B: Pretrained Checkpoint and Instructed Checkpoint.
Florence-VL 3B: Pretrained Checkpoint and Instructed Checkpoint.

Acknowledgement

LLaVA: We start from codebase from the amazing LLaVA.

lmms-eval: Thanks for amazing multimodal evaluation codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
llava		llava
lmms-eval		lmms-eval
playground/data		playground/data
scripts		scripts
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion.

News

Install Environment

Dataset Download

Training Script

Training with llama 3.1-8B (phi-3 is similar)

Evaluation Script

Checkpoint

Acknowledgement

About

Releases

Packages

Languages

License

JiuhaiChen/Florence-VL

Folders and files

Latest commit

History

Repository files navigation

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion.

News

Install Environment

Dataset Download

Training Script

Training with llama 3.1-8B (phi-3 is similar)

Evaluation Script

Checkpoint

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages