Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] add M4C model for TextVQA #213

Closed
wants to merge 4 commits into from

Conversation

ronghanghu
Copy link
Contributor

@ronghanghu ronghanghu commented Jan 6, 2020

Merge the M4C model (https://arxiv.org/pdf/1911.06258.pdf) for TextVQA into Pythia.

Summary of changes:

  • Adding README.md under projects/M4C
  • Adding new models: M4C under pythia/models/m4c.py
  • Adding new dataset classes: m4c_textvqa, m4c_stvqa, and m4c_ocrvqa under pythia/datasets/vqa/
  • Adding new config files under configs/vqa
  • Adding new processors, metrics and losses for M4C training and evaluation.
  • Adding other utilities (such as PHOC feature extraction).

Introducing new dependencies (added to requirements.txt):

  • pytorch-transformers
  • editdistance

M4C for the TextVQA Task

  • R. Hu, A. Singh, T. Darrell, M. Rohrbach, Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA. arXiv preprint arXiv:1911.06258, 2019 (PDF)
@article{hu2019iterative,
  title={Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA},
  author={Hu, Ronghang and Singh, Amanpreet and Darrell, Trevor and Rohrbach, Marcus},
  journal={arXiv preprint arXiv:1911.06258},
  year={2019}
}

Vocabs, ImDBs and Features:

Datasets M4C Vocabs M4C ImDBs Object Faster R-CNN Features OCR Faster R-CNN Features
TextVQA All Vocabs TextVQA ImDB OpenImages TextVQA Rosetta-en OCRs, TextVQA Rosetta-ml OCRs
ST-VQA All Vocabs ST-VQA ImDB ST-VQA Objects ST-VQA Rosetta-en OCRs
OCR-VQA All Vocabs OCR-VQA ImDB OCR-VQA Objects OCR-VQA Rosetta-en OCRs

Pretrained models:

Datasets Configs (under configs/vqa/) Pretrained Models Metrics Notes
TextVQA (m4c_textvqa) m4c_textvqa/m4c_with_stvqa.yml download val accuracy - 40.55%; test accuracy - 40.46% Rosetta-en OCRs; ST-VQA as additional data
TextVQA (m4c_textvqa) m4c_textvqa/m4c.yml download val accuracy - 39.40%; test accuracy - 39.01% Rosetta-en OCRs
TextVQA (m4c_textvqa) m4c_textvqa/m4c_ocr_ml.yml download val accuracy - 37.06% Rosetta-ml OCRs
ST-VQA (m4c_stvqa) m4c_stvqa/m4c.yml download val ANLS - 0.472 (accuracy - 38.05%); test ANLS - 0.462 Rosetta-en OCRs
OCR-VQA (m4c_ocrvqa) m4c_ocrvqa/m4c.yml download val accuracy - 63.52%; test accuracy - 63.87% Rosetta-en OCRs

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 6, 2020
@ronghanghu ronghanghu force-pushed the project/m4c branch 6 times, most recently from 8423d60 to e5b58fd Compare January 6, 2020 17:14
@apsdehal
Copy link
Contributor

Any plans to fix the build?

@ronghanghu
Copy link
Contributor Author

ronghanghu commented Jan 12, 2020

Any plans to fix the build?

Yes! After finishing integration of captioning experiments, I'm addressing the issues you mentioned offline last week (including the CI errors above), and will update this PR.

Copy link
Contributor

@vedanuj vedanuj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure all new files have the required copyright headers.

Comment on lines +5 to +217
"wouldnt": "wouldn't",
"wouldnt've": "wouldn't've",
"wouldn'tve": "wouldn't've",
"yall": "y'all",
"yall'll": "y'all'll",
"y'allll": "y'all'll",
"yall'd've": "y'all'd've",
"y'alld've": "y'all'd've",
"y'all'dve": "y'all'd've",
"youd": "you'd",
"youd've": "you'd've",
"you'dve": "you'd've",
"youll": "you'll",
"youre": "you're",
"youve": "you've",
}

NUMBER_MAP = {
"none": "0",
"zero": "0",
"one": "1",
"two": "2",
"three": "3",
"four": "4",
"five": "5",
"six": "6",
"seven": "7",
"eight": "8",
"nine": "9",
"ten": "10",
}
ARTICLES = ["a", "an", "the"]
PERIOD_STRIP = re.compile("(?!<=\d)(\.)(?!\d)")
COMMA_STRIP = re.compile("(?<=\d)(\,)+(?=\d)")
PUNCTUATIONS = [
";",
r"/",
"[",
"]",
'"',
"{",
"}",
"(",
")",
"=",
"+",
"\\",
"_",
"-",
">",
"<",
"@",
"`",
",",
"?",
"!",
]

def __init__(self, *args, **kwargs):
pass

def word_tokenize(self, word):
word = word.lower()
word = word.replace(",", "").replace("?", "").replace("'s", " 's")
return word.strip()

def process_punctuation(self, in_text):
out_text = in_text
for p in self.PUNCTUATIONS:
if (p + " " in in_text or " " + p in in_text) or (
re.search(self.COMMA_STRIP, in_text) is not None
):
out_text = out_text.replace(p, "")
else:
out_text = out_text.replace(p, " ")
out_text = self.PERIOD_STRIP.sub("", out_text, re.UNICODE)
return out_text

def process_digit_article(self, in_text):
out_text = []
temp_text = in_text.lower().split()
for word in temp_text:
word = self.NUMBER_MAP.setdefault(word, word)
if word not in self.ARTICLES:
out_text.append(word)
else:
pass
for word_id, word in enumerate(out_text):
if word in self.CONTRACTIONS:
out_text[word_id] = self.CONTRACTIONS[word]
out_text = " ".join(out_text)
return out_text

def __call__(self, item):
item = self.word_tokenize(item)
item = item.replace("\n", " ").replace("\t", " ").strip()
item = self.process_punctuation(item)
item = self.process_digit_article(item)
return item

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reuse the existing code and remove this duplication.

Copy link
Contributor Author

@ronghanghu ronghanghu Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vedanuj Thanks for the comments! I'll fix it later this week.

@ronghanghu ronghanghu force-pushed the project/m4c branch 3 times, most recently from ec1b1dc to e0e99c9 Compare January 17, 2020 01:54
@ronghanghu
Copy link
Contributor Author

ronghanghu commented Jan 17, 2020

@vedanuj thanks for your review!

Any plans to fix the build?

It's fixed now, by pytest==5.2.0 in requirements.txt

Please make sure all new files have the required copyright headers.

The header Copyright (c) Facebook, Inc. and its affiliates. is added to the new files.

Please reuse the existing code and remove this duplication.

I found that this is non-trivial to fix. This PR is made against v0.4. However, the EvalAIAnswerProcessor only exists in the current master branch. I tried rebasing against master. However, I found that v0.4 has a major commit ahead of master (926d3b0) that address multitasking #173, which cases a lot of rebase conflict. (Note that this PR for the M4C model is built to be compatible with #173, which I believe will eventually appear in master.)

Do you have suggestions on how to proceed here?

Copy link
Contributor

@apsdehal apsdehal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am landing this internally, but it would be great if you can work on this on separate PR.

requests==2.21.0
fasttext==0.9.1
fastText
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason that we are not using a specific version here?

nltk==3.4.1
pytorch-transformers==1.2.0
editdistance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

@@ -0,0 +1,146 @@
// C implementation of the PHOC respresentation. Converts a string into a PHOC feature vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move M4C specific utils to a folder utils/m4c_utils.

@@ -0,0 +1,50 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go inside distributed utils.

@@ -0,0 +1,22 @@
dataset_attributes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configs currently have a lot of replication and not fully utilizing the power of inheritance in our configuration system.

Copy link
Contributor

@vedanuj vedanuj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general looks good. I think lot of code/config duplication can be avoided.

Comment on lines +9 to +11
# install `vqa-maskrcnn-benchmark` from
# https://github.com/ronghanghu/vqa-maskrcnn-benchmark-m4c
import sys; sys.path.append('/private/home/ronghanghu/workspace/vqa-maskrcnn-benchmark') # NoQA
Copy link
Contributor

@vedanuj vedanuj Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some specific change done for m4c? If not can the https://gitlab.com/meetshah1995/vqa-maskrcnn-benchmark repo be used here? Because it is already being used in feature extraction scripts pythia provides here pythia/scripts/features/extract_features_vmb.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this needs a separate look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are specific changes for OCR feature extraction. The major change is to allow RoI-Pooling from an externally-specified bounding box (OCR boxes in our use case) instead of from the Faster R-CNN's own RPN proposal. The new branch (https://github.com/ronghanghu/vqa-maskrcnn-benchmark-m4c) is compatible with pythia/scripts/features/extract_features_vmb.py, so can also land it into https://gitlab.com/meetshah1995/vqa-maskrcnn-benchmark.

@@ -24,3 +24,5 @@ eggs/
*.egg
.DS_Store
.vscode/*
*.so
*-checkpoint.ipynb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are .ipynb files generated by any of the scripts added here? If not please remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll remove this change.

The *-checkpoint.ipynb files are generated by Jupyter Notebook servers automatically (similar to how vim generates .swp files -- but doesn't delete them after Jupyter Notebook is shut down). But these Jupyter Notebooks were added primarily during our internal analyses and I did not add them here.

Comment on lines +10 to +40
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
- open_images/detectron_fix_100/fc6/train,m4c_textvqa_ocr_en_frcn_features/train_images
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour should be configurable from code in future. This looks very messy now.

Comment on lines +33 to +60
def _image_transform(image_path):
img = Image.open(image_path)
im = np.array(img).astype(np.float32)
# handle a few corner cases
if im.ndim == 2: # gray => RGB
im = np.tile(im[:, :, None], (1, 1, 3))
if im.shape[2] > 3: # RGBA => RGB
im = im[:, :, :3]

im = im[:, :, ::-1] # RGB => BGR
im -= np.array([102.9801, 115.9465, 122.7717])
im_shape = im.shape
im_size_min = np.min(im_shape[0:2])
im_size_max = np.max(im_shape[0:2])
im_scale = float(800) / float(im_size_min)
# Prevent the biggest axis from being more than max_size
if np.round(im_scale * im_size_max) > 1333:
im_scale = float(1333) / float(im_size_max)
im = cv2.resize(
im,
None,
None,
fx=im_scale,
fy=im_scale,
interpolation=cv2.INTER_LINEAR
)
img = torch.from_numpy(im).permute(2, 0, 1)
return img, im_scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if code can be reused between these and pythia/scripts/features/extract_features_vmb.py



@registry.register_processor("bert_tokenizer")
class BertTokenizerProcessor(BaseProcessor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apsdehal We should check if this is same or different from the BertTokenizer we have internally when merging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ours is inside processors/bert.

@@ -160,7 +160,7 @@ def forward(self, sample_list, model_output, *args, **kwargs):
if loss.dim() == 0:
loss = loss.view(1)

key = "{}/{}/{}".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug earlier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I noticed this and fixed it in the rebase. It was already fixed in dev branch,

@@ -42,7 +42,7 @@ def __init__(self, trainer):

self.models_foldername = os.path.join(self.ckpt_foldername, "models")
if not os.path.exists(self.models_foldername):
os.makedirs(self.models_foldername)
os.makedirs(self.models_foldername, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not change this? This might have dangerous consequences where we overwrite trained models by mistake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. This is an issue in case of distributed training which is already fixed in our dev branch. If you see the if statement above you will understand. This happens when there is a race condition and one of the jobs have already made the folder, then the whole thing fails. That's why exist_ok=True needs. exist_ok doesn't overwrite, it just doesn't throw an error if the folder is already there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, From Oleksii's experience, without this change, the program frequently crashes (over 50% time) in distributed training due to a race condition, since multiple processes are trying to make directories, and there's no lock on these lines.

Comment on lines -1 to +10
torch==1.2.0
torchvision==0.2.2
tensorboardX==1.2
torch>=1.2
torchvision>0.2
tensorboardX>=1.2
numpy>=1.14
tqdm==4.19.9
tqdm>=4.19
demjson>=2.2
torchtext>=0.2
GitPython>=2.1
PyYAML>=3.11
pytest==3.3.2
pytest==5.2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes necessary for this PR? If not please remove it from this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the changes with >= as they can be breaking in nature. Rest I accepted and then updated with our changes as of now.

@apsdehal
Copy link
Contributor

@ronghanghu This has landed internally on dev. Please make further PRs on that branch. This should be automatically closed once it lands in master.

@ronghanghu
Copy link
Contributor Author

ronghanghu commented Mar 19, 2020

@apsdehal and @vedanuj Thanks a lot for your review and landing!

(I'll make PRs to dev branch for future changes)

@apsdehal
Copy link
Contributor

Also, I would suggest you to move to automatic download api instead of making users download everything as per the instructions in your readme.

@apsdehal
Copy link
Contributor

apsdehal commented May 7, 2020

Closing as landed internally.

@apsdehal apsdehal closed this May 7, 2020
apsdehal pushed a commit that referenced this pull request May 8, 2020
Closes #213

Merge the M4C model (https://arxiv.org/pdf/1911.06258.pdf) for TextVQA into Pythia.

Summary of changes:
* Adding `README.md` under `projects/M4C`
* Adding new models: M4C under `pythia/models/m4c.py`
* Adding new dataset classes: `m4c_textvqa`, `m4c_stvqa`, and `m4c_ocrvqa` under `pythia/datasets/vqa/`
* Adding new config files under `configs/vqa`
* Adding new processors, metrics and losses for M4C training and evaluation.
* Adding other utilities (such as PHOC feature extraction).

Introducing new dependencies (added to `requirements.txt`):
* `pytorch-transformers`
* `editdistance`

* R. Hu, A. Singh, T. Darrell, M. Rohrbach, *Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA*. arXiv preprint arXiv:1911.06258, 2019 ([PDF](https://arxiv.org/pdf/1911.06258.pdf))
```
@Article{hu2019iterative,
  title={Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA},
  author={Hu, Ronghang and Singh, Amanpreet and Darrell, Trevor and Rohrbach, Marcus},
  journal={arXiv preprint arXiv:1911.06258},
  year={2019}
}
```

| Datasets      | M4C Vocabs | M4C ImDBs | Object Faster R-CNN Features | OCR Faster R-CNN Features |
|--------------|-----|-----|-----------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| TextVQA      | [All Vocabs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_vocabs.tar.gz) | [TextVQA ImDB](https://dl.fbaipublicfiles.com/pythia/m4c/data/imdb/m4c_textvqa.tar.gz) | [OpenImages](https://dl.fbaipublicfiles.com/pythia/features/open_images.tar.gz) | [TextVQA Rosetta-en OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_textvqa_ocr_en_frcn_features.tar.gz), [TextVQA Rosetta-ml OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_textvqa_ocr_ml_frcn_features.tar.gz) |
| ST-VQA      | [All Vocabs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_vocabs.tar.gz) | [ST-VQA ImDB](https://dl.fbaipublicfiles.com/pythia/m4c/data/imdb/m4c_stvqa.tar.gz) | [ST-VQA Objects](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_stvqa_obj_frcn_features.tar.gz) | [ST-VQA Rosetta-en OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_stvqa_ocr_en_frcn_features.tar.gz) |
| OCR-VQA      | [All Vocabs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_vocabs.tar.gz) | [OCR-VQA ImDB](https://dl.fbaipublicfiles.com/pythia/m4c/data/imdb/m4c_ocrvqa.tar.gz) | [OCR-VQA Objects](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_ocrvqa_obj_frcn_features.tar.gz) | [OCR-VQA Rosetta-en OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_ocrvqa_ocr_en_frcn_features.tar.gz) |

| Datasets  | Configs (under `configs/vqa/`)         | Pretrained Models | Metrics                     | Notes                         |
|--------|------------------|----------------------------|-------------------------------|-------------------------------|
| TextVQA (`m4c_textvqa`) | `m4c_textvqa/m4c_with_stvqa.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_textvqa/m4c_textvqa_m4c_with_stvqa.ckpt) | val accuracy - 40.55%; test accuracy - 40.46% | Rosetta-en OCRs; ST-VQA as additional data |
| TextVQA (`m4c_textvqa`) | `m4c_textvqa/m4c.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_textvqa/m4c_textvqa_m4c.ckpt) | val accuracy - 39.40%; test accuracy - 39.01% | Rosetta-en OCRs |
| TextVQA (`m4c_textvqa`) | `m4c_textvqa/m4c_ocr_ml.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_textvqa/m4c_textvqa_m4c_ocr_ml.ckpt) | val accuracy - 37.06% | Rosetta-ml OCRs |
| ST-VQA (`m4c_stvqa`)  | `m4c_stvqa/m4c.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_stvqa/m4c_stvqa_m4c.ckpt) | val ANLS - 0.472 (accuracy - 38.05%); test ANLS - 0.462 | Rosetta-en OCRs |
| OCR-VQA (`m4c_ocrvqa`) | `m4c_ocrvqa/m4c.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_ocrvqa/m4c_ocrvqa_m4c.ckpt) | val accuracy - 63.52%; test accuracy - 63.87% | Rosetta-en OCRs |
apsdehal pushed a commit that referenced this pull request May 8, 2020
Closes #213

Merge the M4C model (https://arxiv.org/pdf/1911.06258.pdf) for TextVQA into Pythia.

Summary of changes:
* Adding `README.md` under `projects/M4C`
* Adding new models: M4C under `pythia/models/m4c.py`
* Adding new dataset classes: `m4c_textvqa`, `m4c_stvqa`, and `m4c_ocrvqa` under `pythia/datasets/vqa/`
* Adding new config files under `configs/vqa`
* Adding new processors, metrics and losses for M4C training and evaluation.
* Adding other utilities (such as PHOC feature extraction).

Introducing new dependencies (added to `requirements.txt`):
* `pytorch-transformers`
* `editdistance`

* R. Hu, A. Singh, T. Darrell, M. Rohrbach, *Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA*. arXiv preprint arXiv:1911.06258, 2019 ([PDF](https://arxiv.org/pdf/1911.06258.pdf))
```
@Article{hu2019iterative,
  title={Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA},
  author={Hu, Ronghang and Singh, Amanpreet and Darrell, Trevor and Rohrbach, Marcus},
  journal={arXiv preprint arXiv:1911.06258},
  year={2019}
}
```

| Datasets      | M4C Vocabs | M4C ImDBs | Object Faster R-CNN Features | OCR Faster R-CNN Features |
|--------------|-----|-----|-----------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| TextVQA      | [All Vocabs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_vocabs.tar.gz) | [TextVQA ImDB](https://dl.fbaipublicfiles.com/pythia/m4c/data/imdb/m4c_textvqa.tar.gz) | [OpenImages](https://dl.fbaipublicfiles.com/pythia/features/open_images.tar.gz) | [TextVQA Rosetta-en OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_textvqa_ocr_en_frcn_features.tar.gz), [TextVQA Rosetta-ml OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_textvqa_ocr_ml_frcn_features.tar.gz) |
| ST-VQA      | [All Vocabs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_vocabs.tar.gz) | [ST-VQA ImDB](https://dl.fbaipublicfiles.com/pythia/m4c/data/imdb/m4c_stvqa.tar.gz) | [ST-VQA Objects](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_stvqa_obj_frcn_features.tar.gz) | [ST-VQA Rosetta-en OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_stvqa_ocr_en_frcn_features.tar.gz) |
| OCR-VQA      | [All Vocabs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_vocabs.tar.gz) | [OCR-VQA ImDB](https://dl.fbaipublicfiles.com/pythia/m4c/data/imdb/m4c_ocrvqa.tar.gz) | [OCR-VQA Objects](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_ocrvqa_obj_frcn_features.tar.gz) | [OCR-VQA Rosetta-en OCRs](https://dl.fbaipublicfiles.com/pythia/m4c/data/m4c_ocrvqa_ocr_en_frcn_features.tar.gz) |

| Datasets  | Configs (under `configs/vqa/`)         | Pretrained Models | Metrics                     | Notes                         |
|--------|------------------|----------------------------|-------------------------------|-------------------------------|
| TextVQA (`m4c_textvqa`) | `m4c_textvqa/m4c_with_stvqa.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_textvqa/m4c_textvqa_m4c_with_stvqa.ckpt) | val accuracy - 40.55%; test accuracy - 40.46% | Rosetta-en OCRs; ST-VQA as additional data |
| TextVQA (`m4c_textvqa`) | `m4c_textvqa/m4c.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_textvqa/m4c_textvqa_m4c.ckpt) | val accuracy - 39.40%; test accuracy - 39.01% | Rosetta-en OCRs |
| TextVQA (`m4c_textvqa`) | `m4c_textvqa/m4c_ocr_ml.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_textvqa/m4c_textvqa_m4c_ocr_ml.ckpt) | val accuracy - 37.06% | Rosetta-ml OCRs |
| ST-VQA (`m4c_stvqa`)  | `m4c_stvqa/m4c.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_stvqa/m4c_stvqa_m4c.ckpt) | val ANLS - 0.472 (accuracy - 38.05%); test ANLS - 0.462 | Rosetta-en OCRs |
| OCR-VQA (`m4c_ocrvqa`) | `m4c_ocrvqa/m4c.yml` | [`download`](https://dl.fbaipublicfiles.com/pythia/m4c/m4c_release_models/m4c_ocrvqa/m4c_ocrvqa_m4c.ckpt) | val accuracy - 63.52%; test accuracy - 63.87% | Rosetta-en OCRs |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants