This repository is the official PyTorch implementation of our paper which will be published at NeurIPS 2020.
- Reproduce the evaluation results on Video-Text Retrieval either with the provided models or by training them from scratch. Configurations and weights for the COOT models described in tables 2 and 3 of the paper are provided.
- Upload COOT feature output. See this issue for an explanation on how to extract them yourself with this version.
- Reproduce the results on Video Captioning described in tables 4 and 5.
- Improve code to make it easier to input a custom dataset.
Requires Python>=3.6
, PyTorch>=1.4
. Tested on Ubuntu. At least 8GB of free RAM are needed to load the text features into memory. GPU training is recommended for speed (requires 2x11GB GPU memory).
- Install Python and PyTorch
- Clone repository:
git clone https://github.com/gingsi/coot-videotext
- Set working directory to be inside the repository:
cd coot-videotext
- Install other requirements
pip install -r requirements.txt
- All future commands in this Readme assume that the current working directory is the root of this repository
Download: Please download the file via P2P using this torrent and kindly keep seeding after you are done. See Troubleshoot / Downloading torrents below for help.
Alternative google drive download: Download Link or Mirror Link
# 1) download ~52GB zipped features to data/activitynet/
# 2) unzip
# after extraction, the folder structure should look like this:
# data/activitynet/features/ICEP_V3_global_pool_skip_8_direct_resize/v_XXXXXXXXXXX.npz
tar -C data/activitynet/features -xvzf data/activitynet/ICEP_V3_global_pool_skip_8_direct_resize.tar.gz
# 3) preprocess dataset and compute bert features
python prepare_activitynet.py
python run_bert.py activitynet --cuda
Download: Please download the file via P2P using this torrent and kindly keep seeding after you are done. See Troubleshoot / Downloading torrents below for help.
Alternative google drive download: Download Link
# 1) download ~13GB zipped features to data/youcook2/
# 2) unzip
tar -C data/youcook2 -xzvf data/youcook2/video_feat_2d3d.tar.gz
# after extraction, the folder structure should look like this:
# data/youcook2/video_feat_2d3d.h5
# 2) preprocess dataset and compute bert features
python prepare_youcook2.py
python run_bert.py youcook2 --cuda --metadata_name 2d3d
Download: Please download the file via P2P using this torrent and kindly keep seeding after you are done. See Troubleshoot / Downloading torrents below for help.
Alternative google drive download: Download Link
# 1) download ~623MB zipped features to data/youcook2/
# 2) unzip
tar -C data/youcook2 -xzvf data/youcook2/video_feat_100m.tar.gz
# after extraction, the folder structure should look like this:
# data/youcook2/video_feat_100m.h5
# 3) preprocess dataset and compute bert features
python prepare_youcook2.py --howto100m
python run_bert.py youcook2 --cuda --metadata_name 100m
Google drive download: Download Link
# 1) download ~100mb zipped models
# 2) unzip
tar -xzvf provided_models.tar.gz
# after extraction, the folder structure should look like this:
# provided_models/MODEL_NAME.pth
--preload_vid # preload video features to RAM (~110GB RAM needed for activitynet, 60GB for youcook2 resnet/resnext, 20GB for youcook2 howto100m)
--workers N # change number of parallel dataloader workers, default: min(10, N_CPU - 1)
--cuda # run on GPU
--single_gpu # run on only one GPU
- We use early stopping. Models are evaluated automatically and results are output during training. To evaluate a model again after training it, check the end of the script output or the logfile in path
runs/MODEL_NAME/log_DATE_TIME.log
to find the best epoch. Then runpython eval.py config/MODEL_NAME.yaml runs/MODEL_NAME/ckpt_ep##.pth
- When training from scratch, actual results may vary due to randomness (no fixed seeds).
- Described train time assumes data preloading with
--preload_vid
and varies due to early stopping.
# train from scratch
python train.py config/anet_coot.yaml --cuda --log_dir runs/anet_coot
# evaluate provided model
python eval.py config/anet_coot.yaml provided_models/anet_coot.pth --cuda --workers 10
Model | Paragraph->Video R@1 | R@5 | R@50 | Video->Paragraph R@1 | R@5 | R@50 | Train time |
---|---|---|---|---|---|---|---|
COOT | 61.3 | 86.7 | 98.7 | 60.6 | 87.9 | 98.7 | ~70min |
# train from scratch (row 1, model with ResNet/ResNext features)
python train.py config/yc2_2d3d_coot.yaml --cuda --log_dir runs/yc2_2d3d_coot
# evaluate provided model (row 1)
python eval.py config/yc2_2d3d_coot.yaml provided_models/yc2_2d3d_coot.pth --cuda
# train from scratch (row 2, model with HowTo100m features)
python train.py config/yc2_100m_coot.yaml --cuda --log_dir runs/yc2_100m_coot
# evaluate provided model (row 2)
python eval.py config/yc2_100m_coot.yaml provided_models/yc2_100m_coot.pth --cuda
Model | Paragraph->Video R@1 | R@5 | R@10 | MR | Sentence->Clip R@1 | R@5 | R@50 | MR | Train time |
---|---|---|---|---|---|---|---|---|---|
COOT with ResNet/ResNeXt features | 51.2 | 79.9 | 88.20 | 1 | 6.6 | 17.3 | 25.1 | 48 | ~180min |
COOT with HowTo100m features | 78.3 | 96.2 | 97.8 | 1 | 16.9 | 40.5 | 52.5 | 9 | ~16 min |
The default datasets folder is data/
. To use a different folder, supply all python scripts with flag --dataroot new_path
and change the commands for dataset preprocessing accordingly.
- Activitynet
- Switch start and stop timestamps when stop > start. Affects 2 videos.
- Convert start/stop timestamps to start/stop frames by multiplying with FPS in the features and using floor/ceil operation respectively.
- Captions: Replace all newlines/tabs/multiple spaces with a single space.
- Cut too long captions (>512 bert tokens in the paragraph) by retaining at least 4 tokens and the [SEP] token for each sentence. Affects 1 video.
- Expand clips to be at least 10 frames long
- train/val_1: 2823 changed / 54926 total
- train/val_2: 2750 changed / 54452 total
- Activitynet and Youcook2
- Add [CLS] at the start of each paragraph and [SEP] at the end of each sentence before encoding with Bert model.
If you have problems downloading our torrents, try following this tutorial:
- Download and install the torrent client qBittorrent.
- Download the torrent files from the links and open them with qBittorrent.
- Options -> Advanced, check the fields "Always announce to all trackers in a tier" and "Always announce to all tiers".
- Options -> BitTorrent, disable "Torrent Queueing"
- Options -> Connection, disable "Use UPnp..." and everything under "Connection Limits" and set Proxy Server to "(None)"
- Options -> Speed, make sure speed is unlimited.
- Right click your torrent and "Force reannounce"
- Right click your torrent and "Force resume"
- Let it run for at least 24 hours.
- If it still doesn't download after waiting for an hour, feel free to open an issue.
- Once you are done, please keep seeding.
For the full references see our paper. We especially thank the creators of the following github repositories for providing helpful code:
- Zhang et al. for retrieval code and activitynet-densecaptions features: CMHSE
- Wang et al. for their captioning model and code: MART
- Miech et al. for their Video feature extractor and their HowTo100M Model
We also thank the authors of all packages in the requirements.txt
file.
Credit of the bird image to Laurie Boyle - Australia.
Code is licensed under Apache2 (Copyright 2020 S. Ging). Dataset features are licensed under Apache2 (Copyright to the respective owners).
If you find our work or code useful, please consider citing our paper:
@inproceedings{ging2020coot,
title={COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning},
author={Simon Ging and Mohammadreza Zolfaghari and Hamed Pirsiavash and Thomas Brox},
booktitle={Conference on Neural Information Processing Systems},
year={2020}
}