This is code and checkpoints for the vision-and-language pre-training model in our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?" (Link).
CLIP-ViL with pre-training sets new single-model state-of-the-arts on benchmarks such as VQA v2.0 (76.70 on test-std).
The code is adopted from both the CLIP repo and the LXMERT repo. Many thanks to the authors of these repos~
We will use COCO images and Visual Genome images for pre-training. We will also use COCO images for VQA fine-tuning.
-
Download COCO images:
mkdir -p data/mscoco wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco wget http://images.cocodataset.org/zips/test2015.zip -P data/mscoco unzip data/mscoco/train2014.zip -d data/mscoco/ && rm data/mscoco/train2014.zip unzip data/mscoco/val2014.zip -d data/mscoco/ && rm data/mscoco/val2014.zip unzip data/mscoco/test2015.zip -d data/mscoco/ && rm data/mscoco/test2015.zip
-
Download Visual Genome images:
cd clip_vl mkdir -p data/vg_raw_images/VG_100K/ wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P data/vg_raw_images wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P data/vg_raw_images unzip data/vg_raw_images/images.zip -d data/vg_raw_images/VG_100K/ unzip data/vg_raw_images/images2.zip -d data/vg_raw_images/VG_100K/
-
Download Images Width and Height Data and save as
clip_vl/data/mscoco/width_heigths.json
-
Download the pre-trainin caption files from LXMERT:
cd clip_vl mkdir -p data/lxmert wget nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/ wget nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/ wget nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/ wget nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/
-
Download the VQA annotation files from LXMERT:
cd clip_vl mkdir -p data/vqa wget nlp.cs.unc.edu/data/lxmert_data/vqa/train.json -P data/vqa/ wget nlp.cs.unc.edu/data/lxmert_data/vqa/nominival.json -P data/vqa/ wget nlp.cs.unc.edu/data/lxmert_data/vqa/minival.json -P data/vqa/
I recommend using docker to run the experiments. Use the image pytorch/pytorch:1.6.0-cuda10.1-cudnn7-devel
as a start.
pip install yacs easydict pycocotools matplotlib pillow commentjson attrdict boto3 h5py requests scikit-learn ftfy regex tqdm ml_collections transformers==3.3.1 msgpack lz4 msgpack_numpy lmdb
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=10.1
conda install --yes -c eumetsat expect
apt-get update
apt-get install wget vim git
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
We follow the LXMERT to do two stage pre-training. At state one, we train without the QA loss; at stage two, we traing with the QA loss.
The general command to run experiments are:
bash BASH_FILE GPUS EXPERIMENT_NAME PORT NUM_NODES AnyOtherParameters
All experiments listed below are run on 8 Nvidia A100 GPUs, each with 40GB memory.
Caveats:
To reduce CPU memory cost, we use shared memory to share annotation files across data readers. Be sure to delete any file with the prefix sharearray_
under /dev/shm/
after you finish training.
-
Command to run first-stage pre-training for RN50 model:
bash scripts/pretrain.bash 0,1,2,3,4,5,6,7 clip_rn50_stage_one 9590 8 --fp16 --gradient_accumulation_steps 2 --batchSize 32 --lr 1e-4 --aspect_ratio_group_factor 5 --add_zero_padding --compress_data --warmup_ratio 0.025 --report_step 200 --numWorkers 20 --train mscoco_train,mscoco_nominival,vgnococo --epochs 20 --sub_sampling --sub_feat_num 100 --schedule 12,17 --use_separate_optimizer_for_visual --sgd_lr 0.003 --sgd_momentum 0.0 --use_positional_embedding
When the model trains after Epoch 9, stop the training. There should be a file named
snap/clip_rn50_stage_one/Epoch09_LXRT.pth
. -
Command to run second-stage pre-training for RN50 model:
bash scripts/pretrain.bash 0,1,2,3,4,5,6,7 clip_rn50_stage_two 9590 8 --fp16 --gradient_accumulation_steps 2 --batchSize 32 --lr 5e-5 --aspect_ratio_group_factor 5 --add_zero_padding --compress_data --warmup_ratio 0.025 --report_step 200 --numWorkers 20 --train mscoco_train,mscoco_nominival,vgnococo --epochs 11 --sub_sampling --sub_feat_num 100 --schedule 4,8 --use_separate_optimizer_for_visual --sgd_lr 0.003 --sgd_momentum 0.0 --use_positional_embedding --load snap/pretrain/clip_rn50_stage_one/Epoch09 --not_load_scheduler --taskQA --not_load_adam_optimizer
When the model finishes training, there should be a file named
snap/clip_rn50_stage_two/Epoch11_LXRT.pth
. Checkpoint on Google Drive.
-
Command to run first-stage pre-training for RN50x4 model:
bash scripts/pretrain.bash 0,1,2,3,4,5,6,7 clip_rn50x4_stage_one 9590 8 --fp16 --gradient_accumulation_steps 2 --batchSize 30 --lr 5e-5 --aspect_ratio_group_factor 5 --add_zero_padding --compress_data --warmup_ratio 0.025 --report_step 200 --numWorkers 20 --train mscoco_train,mscoco_nominival,vgnococo --epochs 20 --sub_sampling --sub_feat_num 100 --schedule 12,17 --use_separate_optimizer_for_visual --sgd_lr 0.003 --sgd_momentum 0.0 --use_positional_embedding --clip_model_name RN50x4
When the model trains after Epoch 9, stop the training. There should be a file named
snap/pretrain/clip_rn50x4_stage_one/Epoch09_LXRT.pth
. -
Command to run second-stage pre-training for RN50 model:
bash scripts/pretrain.bash 0,1,2,3,4,5,6,7 clip_rn50x4_stage_two 9590 8 --fp16 --gradient_accumulation_steps 2 --batchSize 30 --lr 2.75e-5 --aspect_ratio_group_factor 5 --add_zero_padding --compress_data --warmup_ratio 0.025 --report_step 200 --numWorkers 20 --train mscoco_train,mscoco_nominival,vgnococo --epochs 11 --sub_sampling --sub_feat_num 100 --schedule 4,9 --use_separate_optimizer_for_visual --sgd_lr 0.003 --sgd_momentum 0.0 --use_positional_embedding --load snap/pretrain/clip_rn50x4_stage_one/Epoch09 --not_load_scheduler --taskQA --not_load_adam_optimizer
When the model finishes training, there should be a file named
snap/pretrain/clip_rn50x4_stage_two/Epoch11_LXRT.pth
. Checkpoint on Google Drive.
Currently, we provide the scripts to fine-tune on VQA. Experiments can be run on 4 Nvidia 2080Tis.
Caveats:
To reduce CPU memory cost, we use shared memory to share annotation files across data readers. Be sure to delete any file with the prefix sharearray_
under /dev/shm/
after you finish training.
-
Training (RN50x4) (First download the pre-trained checkpoint to
snap/pretrain/clip_rn50x4_stage_two/Epoch11_LXRT.pth
):./scripts/file.sh 0,1,2,3 model_name 9590 4 2
When the model finishes training, there should be a file named
snap/vqa/vqa_clip_rn50x4/BEST_LXRT.pth
. Checkpoint on Google Drive. -
Testing:
./scripts/file.sh 0 model_name 9590 1 --load snap/vqa/model_name/BEST --test minival or test
This should give the score around 73.91 on minival (minival scores are usually 2~3 points lower than those on
test-dev
).Change
--test minival
to--test test
to generate a json filesnap/vqa/test/test_predict.json
, which could submited to the leaderboard. Using the provided checkpoint should give a score close to what is reported in the paper .