Skip to content

Latest commit

 

History

History
80 lines (67 loc) · 5.99 KB

SETUP.md

File metadata and controls

80 lines (67 loc) · 5.99 KB

Setup Instructions

We provide instructions for setting up the COCO and Visual Genome datasets, including precomputed large-vocabulary object detections. The procedure is modular: if you are only interested in evaluating our pretrained models, you are not required to set up the training sets.

Dependencies

To run the code, you need the following dependencies:

  • Python 3.5+
  • PyTorch 1.0+
  • Pycocotools (pip install pycocotools)
  • Dominate (pip install dominate)
  • Dill (pip install dill)
  • Optional: Huggingface transformers library (only if you want to train/evaluate the models with captions)
  • Optional: seg_every_thing (only if you want to use your own detections instead of the ones we provide)

COCO dataset setup

For COCO, we adopt the 2017 train/val split (the same one used by COCO Stuff). Download the val2017 images, train2017 images (only if you want to train a new model), and annotations. These links can also be found in the COCO web page. Extract the archives and set up your directory tree so that it looks like this:

datasets/coco/val2017/
datasets/coco/train2017/
datasets/coco/annotations/instances_val2017.json
datasets/coco/annotations/instances_train2017.json
datasets/coco/annotations/captions_val2017.json
datasets/coco/annotations/captions_train2017.json

Of course, you are also free to create symbolic links to these files/directories in case you have already installed COCO somewhere else.

Next, scroll down for the instructions on how to set up the detections.

Note: since our approach is weakly-supervised, the only annotations we require are the captions for the models trained with style control. The instances files are only used to get the image metadata when cross-testing between COCO and VG.

Visual Genome dataset setup

Download images.zip, images2.zip, image metadata, attributes, and attribute synsets. You can also find these links in the Visual Genome website. Extract the archives and set up the directory tree as follows:

datasets/vg/VG_100K/
datasets/vg/VG_100K_2/
datasets/vg/image_data.json
datasets/vg/attributes.json
datasets/vg/attribute_synsets.json

Next, scroll down for the instructions on how to set up the detections.

Visual Genome augmented (VG+) setup

This dataset is an extension of Visual Genome where we add images from the COCO unlabeled set. For this set you need to download the COCO unlabeled2017 images, extract them to datasets/coco/unlabeled2017/, and set up detections. You can then train a model in this setting by adding the flag --augmented_training_set in train.py.

Detections setup

The fastest way to set up object detections is to download our precomputed detections from the Releases page of this repo. These have to be extracted as follows:

datasets/coco/detections_train2017/
datasets/coco/detections_val2017/
datasets/vg/detections_VG_100K/
datasets/vg/detections_VG_100K_2/

For instance, detections_coco_val2017.tar.xz must be downloaded to the coco/ directory and extracted with tar -xvJf detections_coco_val2017.tar.xz.

We also provide a script to infer these detections from scratch. You first have to install seg_every_thing. Since this is based on the Detectron 1 codebase, you need to set up a Python 2.7 environment. Next, set up their pretrained models following their instructions and copy our script tools/infer_detections.py to their tools directory.

Detections can finally be generated by running:

python2 tools/infer_detections.py \
	--cfg configs/bbox2mask_vg/eval_sw_R101/runtest_clsbox_2_layer_mlp_nograd_R101.yaml \
	--output-dir /path/to/datasets/coco/detections_val2017 \
	--image-ext jpg \
	--thresh 0.2 \
	--filter-classes \
	--use-vg3k \
	--wts lib/datasets/data/trained_models/33219850_model_final_coco2vg3k_seg.pkl \
	/path/to/datasets/coco/val2017

In this example we infer detections for COCO val2017, but the command has to be run for every directory. Note that detections are pre-filtered by class and thresholded to reduce the size of the outputs. To get the raw output (i.e. all 3000 classes without filtering), you can remove --filter-classes and set --thresh 0.0. This has no impact on the final result, as classes are filtered later anyway, but it might prove useful in case you want to train a new model with a different set of classes.

Captions setup

To train a model from scratch, it is highly recommended to precompute the BERT sentence representations and save them to a cache directory. Since the GAN architecture already has a significant memory footprint, you might not be able to reach a sufficient batch size if BERT is also loaded in the GPU memory. If sentence representations are precomputed, BERT does not need to be loaded, and training is greatly sped up.

You can instruct our training/testing/evaluation scripts to use precomputed sentence representations through the --use_precomputed_captions flag, but these have to be generated in advance using precompute_captions.py. It suffices to run this script without arguments:

python precompute_captions.py

The token representations will be extracted from the second-to-last layer of BERT and saved to datasets/coco/precomputed_captions_level2_val2017/ and datasets/coco/precomputed_captions_level2_train2017/