Model Name | Arxiv Time | Paper | Code | Resources |
---|---|---|---|---|
ViLBERT | Aug 6 2019 | paper | official | |
VisualBERT | Aug 9 2019 | paper | official | huggingface |
LXMERT | Aug 20 2019 | paper | official | huggingface |
VL-BERT | Aug 22 2019 | paper | official | |
UNITER | Sep 25 2019 | paper | official | |
PixelBERT | Apr 2 2020 | paper | ||
Oscar | Apr 4 2020 | paper | official | |
VinVL | Jan 2 2021 | paper | official | |
ViLT | Feb 5 2021 | paper | official | huggingface |
CLIP-ViL | Jul 13 2021 | paper | official | |
METER | Nov 3 2021 | paper | official |
Model Name | Arxiv Time | Paper | Code | Comment |
---|---|---|---|---|
CLIP | Feb 26 2021 | paper | offical | Powerful representation learnt through large-scale image-text contrastive pairs. |
ALIGN | Feb 11 2021 | paper | Impressive image-text retrieval ability. | |
FILIP | Nov 9 2021 | paper | Finer-grained representation learnt through contrastive patches. | |
LiT | Nov 15 2021 | paper | official | Frozen image encoders proven to be effective. |
Florence | Nov 22 2021 | paper | Large scale contrastive pretraining and adapted to vision downstream tasks. | |
FLIP | Dec 1 2022 | paper | offical | Further scaled up negative samples by masking out 95% image patches. |
Model Name | Arxiv Time | Paper | Code | Comment |
---|---|---|---|---|
MDETR | Apr 26 2021 | paper | official | Impressive visual grounding abilities achieved with DETR and RoBERTa |
ALBEF | Jul 16 2021 | paper | official | BLIP's predecessor. Contrastive learning for unimodal representation followed by a multimodal transformer-based encoder. |
UniT | Aug 3 2021 | paper | official | |
VLMo | Nov 3 2021 | paper | official | Mixture of unimodal experts before multimodal experts. |
UFO | Nov 19 2021 | paper | ||
FLAVA | Dec 8 2021 | paper | official | Multitask training for unimodal and multimodal representations. Can be finetuned for a variety of downstream tasks. |
BEiT-3 | Aug 8 2022 | paper | official | VLMo scaled up. |
Model Name | Arxiv Time | Paper | Code | Comment |
---|---|---|---|---|
VL-T5 | Feb 4 2021 | paper | official | Unified image-text tasks with text generation, also capable of grounding. |
SimVLM | Aug 24 2021 | paper | official | Pretrained with large-scale image-text pairs and image-text tasks with prefix LM. |
UniTab | Nov 23 2021 | paper | official | Unified text generation with bounding box outputs. |
BLIP | Jan 28 2021 | paper | official | Capfilt method for bootstrapping image-text pair data generation. Contrastive learning, image-text matching and LM as objectives. |
CoCa | May 4 2022 | paper | pytorch | Large-scale image-text contrastive learning with text generation(LM) |
GIT | May 27 2022 | paper | official | GPT-like language model conditioned on visual features extracted by pretrained ViT. (SoTA on image captioning tasks) |
DaVinci | Jun 15 2022 | paper | official | Output generation conditioned on prefix texts or prefix images. Supports text and image generation. |
Model Name | Arxiv Time | Paper | Code |
---|---|---|---|
Frozen | Jun 25 2021 | paper | |
Flamingo | Apr 29 2022 | paper | OpenFlamingo |
MetaLM | Jun 13 2022 | paper | official |
PaLI | Sep 14 2022 | paper | |
BLIP-2 | Jan 30 2023 | paper | official |
KOSMOS | Feb 27 2023 | paper | official |
PaLM-E | Mar 6 2023 | paper |
Model Name | Arxiv Time | Paper | Code |
---|---|---|---|
LLaVA | Apr 17 2023 | paper | official |
Mini-GPT4 | Apr 20 2023 | paper | official |
Otter | May 5 2023 | paper | official |
InstructBLIP | May 11 2023 | paper | official |
VisionLLM | May 18 2023 | paper | official |
KOSMOS-2 | Jun 26 2023 | paper | official |
Emu | Jul 11 2023 | paper | official |
Model Name | Arxiv Time | Paper | Code |
---|---|---|---|
MAGMA | Dec 9 2021 | paper | official |
VL-Adapter | Dec 13 2021 | paper | official |
LiMBeR | Sep 30 2022 | paper | official |
LLaMA-Adapter | Mar 28 2023 | paper | official |
LLaMA-Adapter-v2 | Apr 28 2023 | paper | official |
UniAdapter | May 21 2023 | paper | official |
ImageBind-LLM | Sep 11 2023 | paper | official |
Model Name | Arxiv Time | Paper | Code |
---|---|---|---|
Visual Programming | Nov 18 2022 | paper | official |
ViperGPT | Mar 14 2023 | paper | official |
MM-React | Mar 20 2023 | paper | official |
Chameleon | May 24 2023 | paper | official |
HuggingGPT | May 25 2023 | paper | official |
IdealGPT | May 24 2023 | paper | official |
NextGPT | Sep 13 2023 | paper |
Dataset | Time | Size | Format | Task | Link |
---|---|---|---|---|---|
SBU Captions | 2011 | 1M | image-text pairs | pretraining/image captioning | https://vislang.ai/sbu-explorer |
YFCC-100M | 2015 | 100M | image-text pairs | pretraining | https://multimediacommons.wordpress.com/yfcc100m-core-dataset/ |
CC3M | 2018 | 3M | image-text pairs | pretraining/image captioning | https://github.com/google-research-datasets/conceptual-12m |
LAIT | 2020 | 10M | image-text pairs | pretraining | |
Localized Narratives | 2020 | 849K | image-text pairs | pretraining | https://google.github.io/localized-narratives/ |
CC12M | 2021 | 12M | image-text pairs | pretraining | https://github.com/google-research-datasets/conceptual-12m |
LAION-400M | 2021 | 400M | image-text pairs | pretraining | https://laion.ai/laion-400-open-dataset/ |
RedCaps | 2021 | 12M | image-text pairs | pretraining | https://redcaps.xyz/ |
WIT | 2021 | 37.5M | image-text pairs | pretraining | https://github.com/google-research-datasets/wit |
LAION-5B | 2022 | 5B | image-text pairs | pretraining | https://laion.ai/blog/laion-5b/ |
Dataset | Time | Size | Format | Task | Link |
---|---|---|---|---|---|
Flickr30k | 2014 | 30K | image-text pairs | image captioning | https://arxiv.org/abs/1505.04870 |
COCO | 2014 | 567K | image-text pairs | image captioning | https://cocodataset.org/#home |
TextCaps | 2020 | 28K | image-text pairs | image captioning | https://textvqa.org/textcaps/ |
VizWiz | 2020 | 20K | image-question-answer pairs | VQA | https://vizwiz.org/tasks-and-datasets/vqa/ |
Dataset | Time | Size | Format | Task | Link |
---|---|---|---|---|---|
Visual Genome | 2017 | 108K | image-question-answer pairs, region descriptions | VQA/pretraining | https://homes.cs.washington.edu/~ranjay/visualgenome/index.html |
VQA v2 | 2017 | 1.1M | question-answer pairs | VQA | https://visualqa.org/ |
TextVQA | 2019 | 28K | image-question-answer pairs | VQA | https://textvqa.org/ |
OCR-VQA | 2019 | 1M | image-question-answer pairs | VQA | https://ocr-vqa.github.io/ |
ST-VQA | 2019 | 31K | image-question-answer pairs | VQA | https://arxiv.org/abs/1905.13648 |
OK-VQA | 2019 | 14K | image-question-answer pairs | VQA | https://okvqa.allenai.org/ |
VizWiz | 2020 | 20K | image-question-answer pairs | VQA | https://vizwiz.org/tasks-and-datasets/vqa/ |
IconQA | 2021 | 107K | image-question-answer pairs | VQA | https://iconqa.github.io/ |
ScienceQA | 2022 | 21K | image-question-answer pairs | VQA | https://github.com/lupantech/ScienceQA |
Dataset | Time | Size | Format | Task | Link |
---|---|---|---|---|---|
NLVR | 2017 | 92K | image-grounded statements | reasoning | https://lil.nlp.cornell.edu/nlvr/ |
GQA | 2019 | 1M | image-text pairs | visual reasoning/question answering | https://cs.stanford.edu/people/dorarad/gqa/about.html |
Visual Commonsense Reasoning | 2019 | 110K | image-question-answer pairs | reasoning | https://visualcommonsense.com/ |
SNLI-VE | 2019 | 530K | image-question-answer pairs | reasoning | https://github.com/necla-ml/SNLI-VE |
Winoground | 2022 | image-text pairs | reasoning | https://huggingface.co/datasets/facebook/winoground |