Skip to content

Official PyTorch implementation of the paper "CoVR: Learning Composed Video Retrieval from Web Video Captions".

License

Notifications You must be signed in to change notification settings

lucas-ventura/CoVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CoVR: Composed Video Retrieval

Learning Composed Video Retrieval from Web Video Captions

Lucas Ventura Β· Antoine Yang Β· Cordelia Schmid Β· GΓΌl Varol

AAAI 2024 TPAMI 2024

arXiv Project Page License GitHub Stars

PWC
PWC
PWC

CoVR teaser gif

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available.

Description

This repository contains the code for the paper "CoVR: Learning Composed Video Retrieval from Web Video Captions".

Please visit our webpage for more details.

This repository contains:

πŸ“¦ covr
 ┣ πŸ“‚ configs                 # hydra config files
 ┣ πŸ“‚ src                     # Pytorch datamodules
 ┣ πŸ“‚ tools                   # scrips and notebooks
 ┣ πŸ“œ .gitignore
 ┣ πŸ“œ LICENSE
 ┣ πŸ“œ README.md
 ┣ πŸ“œ test.py
 β”— πŸ“œ train.py

Installation πŸ‘·

Create environment  
conda create --name covr
conda activate covr

To install the necessary packages, you can use the provided requirements.txt file:

python -m pip install -r requirements.txt

The code was tested on Python 3.10 and PyTorch 2.4.

Download the datasets

WebVid-CoVR

To use the WebVid-CoVR dataset, you will have to download the WebVid videos and the WebVid-CoVR annotations.

To download the annotations, run:

bash tools/scripts/download_annotation.sh covr

To download the videos, install mpi4py (conda install -c conda-forge mpi4py) and run:

ln -s /path/to/your/datasets/folder datasets
python tools/scripts/download_covr.py --split=<train, val or test>

CC-CoIR

To use the CC-CoIR dataset, you will have to download the Conceptual Caption images and the CC-CoIR annotations.

To download the annotations, run:

bash tools/scripts/download_annotation.sh coir

CIRR

To use the CIRR dataset, you will have to download the CIRR images and the CIRR annotations.

To download the annotations, run:

bash tools/scripts/download_annotation.sh cirr

To download the images, follow the instructions in the CIRR repository. The default folder structure is the following:

πŸ“¦ CoVR
 ┣ πŸ“‚ datasets  
 ┃ ┣ πŸ“‚ CIRR
 ┃ ┃ ┣ πŸ“‚ images
 ┃ ┃ ┃ ┣ πŸ“‚ train
 ┃ ┃ ┃ ┣ πŸ“‚ dev
 ┃ ┃ ┃ β”— πŸ“‚ test1

FashionIQ

To use the FashionIQ dataset, you will have to download the FashionIQ images and the FashionIQ annotations.

To download the annotations, run:

bash tools/scripts/download_annotation.sh fiq

To download the images, the urls are in the FashionIQ repository. You can use the this script to download the images. Some missing images can also be found here. All the images should be placed in the same folder (datasets/fashion-iq/images).

CIRCO

To use the CIRCO dataset, download both the CIRCO images and the CIRCO annotations. Follow the structure provided in the CIRCO respository and place the files in the datasets/ directory.

(Optional) Download pre-trained models

To download the checkpoints, run:

bash tools/scripts/download_pretrained_models.sh

Usage πŸ’»

Computing BLIP embeddings  

Before training, you will need to compute the BLIP embeddings for the videos/images. To do so, run:

# This will compute the BLIP embeddings for the WebVid-CoVR videos. 
# Note that you can use multiple GPUs with --num_shards and --shard_id
python tools/embs/save_blip_embs_vids.py --video_dir datasets/WebVid/2M/train --todo_ids annotation/webvid-covr/webvid2m-covr_train.csv 

# This will compute the BLIP embeddings for the WebVid-CoVR-Test videos.
python tools/embs/save_blip_embs_vids.py --video_dir datasets/WebVid/8M/train --todo_ids annotation/webvid-covr/webvid8m-covr_test.csv 

# This will compute the BLIP embeddings for the CIRR images.
python tools/embs/save_blip_embs_imgs.py --image_dir datasets/CIRR/images/

# This will compute the BLIP embeddings for FashionIQ images.
python tools/embs/save_blip_embs_imgs.py --image_dir datasets/fashion-iq/images/

# This will compute the BLIP embeddings for the WebVid-CoVR modifications text. Only needed if using the caption retrieval loss (model/loss_terms=si_ti+si_tc).
python tools/embs/save_blip_embs_txts.py annotation/webvid-covr/webvid2m-covr_train.csv datasets/WebVid/2M/blip-vid-embs-large-all

 

Computing BLIP-2 embeddings  

Before training, you will need to compute the BLIP-2 embeddings for the videos/images. To do so, run:

# This will compute the BLIP-2 embeddings for the WebVid-CoVR videos. 
# Note that you can use multiple GPUs with --num_shards and --shard_id
python tools/embs/save_blip2_embs_vids.py --video_dir datasets/WebVid/2M/train --todo_ids annotation/webvid-covr/webvid2m-covr_train.csv 

# This will compute the BLIP-2 embeddings for the WebVid-CoVR-Test videos.
python tools/embs/save_blip2_embs_vids.py --video_dir datasets/WebVid/8M/train --todo_ids annotation/webvid-covr/webvid8m-covr_test.csv 

# This will compute the BLIP-2 embeddings for the CIRR images.
python tools/embs/save_blip2_embs_imgs.py --image_dir datasets/CIRR/images/test1 --save_dir datasets/CIRR/blip2-embs-large/test1
python tools/embs/save_blip2_embs_imgs.py --image_dir datasets/CIRR/images/dev --save_dir datasets/CIRR/blip2-embs-large/dev
python tools/embs/save_blip2_embs_imgs.py --image_dir datasets/CIRR/images/train --save_dir datasets/CIRR/blip2-embs-large/train

# This will compute the BLIP-2 embeddings for FashionIQ images.
python tools/embs/save_blip2_embs_imgs.py --image_dir datasets/fashion-iq/images/

# This will compute the BLIP-2 embeddings for the WebVid-CoVR modifications text. Only needed if using the caption retrieval loss (model/loss_terms=si_ti+si_tc).
python tools/embs/save_blip2_embs_txts.py annotation/webvid-covr/webvid2m-covr_train.csv datasets/WebVid/2M/blip2-vid-embs-large-all

 

Training  

The command to launch a training experiment is the folowing:

python train.py [OPTIONS]

The parsing is done by using the powerful Hydra library. You can override anything in the configuration by passing arguments like foo=value or foo.bar=value. See Options parameters section at the end of this README for more details.

 

Evaluating  

The command to evaluate is the folowing:

python test.py test=<test> [OPTIONS]

 

Options parameters

Datasets:

  • data=webvid-covr: WebVid-CoVR datasets.
  • data=cirr: CIRR dataset.
  • data=fashioniq: FashionIQ dataset.
  • data=cc-coir: CC-CoIR dataset.
  • data=cc-coir+webvid-covr: WebVid-CoVR and CC-CoIR dataset.

Models:

  • model=blip-large: BLIP model.
  • model=blip2-coco: BLIP-2 model. Needs to be used in conjunction with model/ckpt=blip2-l-coco or BLIP-2 checkpoint.

Tests:

  • test=all: Test on WebVid-CoVR, CIRR and all three Fashion-IQ test sets.
  • test=webvid-covr: Test on WebVid-CoVR.
  • test=cirr: Test on CIRR.
  • test=fashioniq: Test on all three Fashion-IQ test sets (dress, shirt and toptee).
  • test=circo: Test on CIRCO.

Checkpoints:

  • model/ckpt=blip-l-coco: Default checkpoint for BLIP-L finetuned on COCO.
  • model/ckpt=webvid-covr: Default checkpoint for CoVR finetuned on WebVid-CoVR.
  • model/ckpt=fashioniq-all-ft_covr: Default checkpoint pretrained on WebVid-CoVR and finetuned on FashionIQ.
  • model/ckpt=cirr_ft-covr+gt: Default checkpoint pretrained on WebVid-CoVR and finetuned on CIRR.
  • model/ckpt=blip2-l-coco: Default checkpoint for BLIP-2 L finetuned on COCO.
  • model/ckpt=blip2-l-coco_coir: Default checkpoint for BLIP-2 L pretrained on COCO and finetuned on CC-CoIR.
  • model/ckpt=blip2-l-coco_coir+covr: Default checkpoint for BLIP-2 L pretrained on COCO, finetuned on CC-CoIR and WebVid-CoVR.

Training

  • trainer=gpu: training with CUDA, change devices to the number of GPUs you want to use.
  • trainer=ddp: training with Distributed Data Parallel (DDP), change devices and num_nodes to the number of GPUs and number of nodes you want to use.
  • trainer=cpu: training on the CPU (not recommended).

Logging

  • trainer/logger=csv: log the results in a csv file. Very basic functionality.
  • trainer/logger=wandb: log the results in wandb. This requires to install wandb and to set up your wandb account. This is what we used to log our experiments.
  • trainer/logger=<other>: Other loggers (not tested).

Machine

  • machine=server: You can change the default path to the dataset folder and the batch size. You can create your own machine configuration by adding a new file in configs/machine.

Experiment

There are many pre-defined experiments from the paper in configs/experiment and configs/experiment2. Simply add experiment=<experiment> or experiment2=<experiment> to the command line to use them.

 

Citation

If you use this dataset and/or this code in your work, please cite our paper:

@article{ventura24covr,
    title     = {{CoVR}: Learning Composed Video Retrieval from Web Video Captions},
    author    = {Lucas Ventura and Antoine Yang and Cordelia Schmid and G{\"u}l Varol},
    journal   = {AAAI},
    year      = {2024}
  }

@article{ventura24covr2,
  title = {{CoVR-2}: Automatic Data Construction for Composed Video Retrieval},
  author = {Lucas Ventura and Antoine Yang and Cordelia Schmid and G{\"u}l Varol},
  journal = {IEEE TPAMI},
  year = {2024}
}

Acknowledgements

Based on BLIP and lightning-hydra-template.

About

Official PyTorch implementation of the paper "CoVR: Learning Composed Video Retrieval from Web Video Captions".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published