Implementation of "Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations" by Fenglin Liu, Yuanxin Liu, Xuancheng Ren, Xiaodong He, and Xu Sun. The paper can be found at [arxiv], [pdf].
Semantic-Grounded Image Representations (Based on the Bottom-up features)
This code is written in Python2.7 and requires PyTorch >= 0.4.1
You may take a look at https://github.com/s-gupta/visual-concepts to find how to get the textual concepts of an image by yourself.
- Download
Download the mscoco images from link. We need 2014 training images and 2014 val. images. You should put the train2014/ and val2014/ in the ./data/images/ directory.
Note: We also provide a download bash script to download the mscoco images:
cd data/images/original && bash download_mscoco_images.sh
- Preprocess
Now we may need to run resize.py to resize all the images (in both train and val folder) into 256 x 256. You may specify different locations inside resize.py
python resize_images.py
- Download
You may download the mscoco captions from the official website or use the download bash script provided by us.
cd data && bash download_mscoco_captions.sh
- Preprocess
Afterwards, we should create the Karpathy split for training, validation and test.
python KarpathySplit.py
Then we can build the vocabulary by running (Note: You should download the nltk_data to build the vocabulary.)
unzip nltk_data.zip && python build_vocab.py
Download the Textual Concepts (Google Drive) and put it in the ./data/ directory.
mv image_concepts.json ./data
Now we can train the baseline models and the baseline w/ MIA models with:
- Baseline
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=VisualAttention
- Baseline w/ MIA
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=VisualAttention --use_MIA=True --iteration_times=2
- Baseline
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=ConceptAttention
- Baseline w/ MIA
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=ConceptAttention --use_MIA=True --iteration_times=2
- Baseline
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=VisualCondition
- Baseline w/ MIA
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=VisualCondition --use_MIA=True --iteration_times=2
- Baseline
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=ConceptCondition
- Baseline w/ MIA
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=ConceptCondition --use_MIA=True --iteration_times=2
- Baseline
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=VisualRegionalAttention
- Baseline w/ MIA
CUDA_VISIBLE_DEVICES=0,1 python Train.py --basic_model=VisualRegionalAttention --use_MIA=True --iteration_times=2
We can test the trained model with
- Baseline
CUDA_VISIBLE_DEVICES=0 python Test.py --basic_model=basic_model_name
Note: basic_model_name = (VisualAttention, ConceptAttention, VisualCondition, ConceptCondition, VisualRegionalAttention)
- Baseline w/ MIA
CUDA_VISIBLE_DEVICES=0 python Test.py --basic_model=basic_model_name --use_MIA=True --iteration_times=2
If you use this code or our extracted image concepts as part of any published research, please acknowledge the following paper
@inproceedings{Liu2019MIA,
author = {Fenglin Liu and
Yuanxin Liu and
Xuancheng Ren and
Xiaodong He and
Xu Sun},
title = {Aligning Visual Regions and Textual Concepts for Semantic-Grounded
Image Representations},
booktitle = {NeurIPS},
pages = {6847--6857},
year = {2019}
}
Thanks to Torch team for providing Torch 0.4, COCO team for providing dataset, Tsung-Yi Lin for providing evaluation codes for MS COCO caption generation, Yufeng Ma for providing open source repositories and Torchvision ResNet implementation.
If you have any questions about the code or our paper, please send an email to [email protected]