This code is released as a supplementary material to "Top-down Visual Saliency Guided by Captions" (CVPR 2017).
Clone this repo (including coco-caption as a submodule):
$ git clone --recursive [email protected]:VisionLearningGroup/caption-guided-saliency.git
Install dependencies
The model is implemented using TensorFlow framework, Python 2.7. For TensorFlow installation please refer to the official Installing TensorFlow guide or simply:
$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl
Warning! The standard version of TensorFlow gives the warnings like:
The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
It's fine. To get rid of them you'll need to build TensorFlow from sources with --config=opt
.
List of other required python modules:
$ pip install tqdm numpy six pillow matplotlib scipy
The code also uses ffmpeg
for data preprocessing.
Obtain the dataset you need:
-
MSR-VTT: train_val_videos.zip, train_val_annotation.zip, test_videos.zip, test_videodatainfo.json
-
Flickr30k: flickr30k.tar.gz, flickr30k-images.tar
and unpack files into their respective directories under ./DATA/
.
Expected layout so far is:
./DATA/
└───MSR_VTT/
│ │ test_videodatainfo.json
│ │ train_val_videodatainfo.json
│ │
│ └───TestVideo/
│ │ ...
│ │
│ └───TrainValVideo/
│ ...
└───Flickr30k
│ results_20130124.token
│
└───flickr30k-images/
...
Run data preprocessing
$ python preprocessing.py --dataset {MSR-VTT|Flickr30k}
This step takes ~30mins for Flickr30k and ~2h for MSR-VTT.
Run training
$ python run_s2vt.py --dataset {MSR-VTT|Flickr30k} --train
We do not finetune CNN part of the model, thus, training on GPU takes only several hours. Training/validation/test splits for Flickr30k are taken from NeuralTalk. After the training you can run evaluation of the model:
$ python run_s2vt.py --dataset {MSR-VTT|Flickr30k} --test --checkpoint {number}
After you got the model which was trained to produce captions for MSR-VTT dataset, you can get video with saliency visualization similar to those in the beginning of the readme:
$ python visualization.py --dataset MSR-VTT \
--media_id video9461 \
--checkpoint {number} \
--sentence "A man is driving a car"
where media_id should belong to the test split of MSR-VTT, sentence sets a query phrase.
You can change model's parameters (dimensionality of layers, learning rate etc.) directly in cfg.py. Every run of run_s2vt.py
with --train
switch will overwrite files in experiments
directory.
If you find this useful in your work please consider citing:
@inproceedings{Ramanishka2017cvpr,
title = {Top-down Visual Saliency Guided by Captions},
author = {Vasili Ramanishka and Abir Das and Jianming Zhang and Kate Saenko},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2017}
}