Awesome Vision Foundation Models Unifying Understanding and Generation

Solving all language tasks and exploring scaling laws with LLMs are so cool. How about building vision foundation models to unlock the vision AGI? Also, we are now good at handling vision tasks seperately. How about unifying vision understanding and generation to solve vision tasks all together?
As the long-term roadmap for building a generalist foundation model to solve all vision tasks has not yet been fully determined, we conduct the first survey to provide comprehensive summary and in-depth analysis on vision foundation models unifying both understanding and generation.
This repository provides a curated list of related papers and resources, come take a look and let's share insights on unresolved challenges together! Please stay tuned and give us a 🌟 if you are interested in our project, we will update the latest advancements both on our paper and this repo continuously.

🔥 News

[2024-12-24] We open this repo, looking forward to your feedbacks.

[2024-12-02] We build this repo, a complete and revised version of our survey is coming soon.

[2024-10-29] We have released the survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective.

🍻 Contributions

Both suggestions about our project and sharing of your work are welcomed! Please feel free to open an issue, pull a request or send us an email. We will update them regularly.

📃 Table of Contents

Vision Foundation-Models Unifying Understanding and Generation
Related Survey

Vision Foundation Models Unifying Understanding and Generation

Autoregression

Title	Name	Venue	Date	Organization	Open Source
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	InternVL 2.5	arXiv	2024-12-06	Shanghai AI Lab, SenseTime, THU, NJU, FDU, CUHK, SJTU	Github
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation	TokenFlow	arXiv	2024-12-04	ByteDance	Github
JetFormer: An Autoregressive Generative Model of Raw Images and Text	JetFormer	arXiv	2024-11-29	Google Deepmind	-
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding	MUSE-VL	arXiv	2024-11-26	ByteDance	-
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing	UniPose	arXiv	2024-11-25	CAS	-
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE	SAR3D	arXiv	2024-11-25	NTU, Shanghai AI Lab	Github
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models	LLaMA-Mesh	arXiv	2024-11-14	THU, NVIDIA	Github
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation	Janus	arXiv	2024-10-17	DeepSeek, HKU, PKU	Github
UniMuMo: Unified Text, Music and Motion Generation	UniMuMo	arXiv	2024-10-06	CUHK, UW, UBC, UMass Amherst, MIT-IBM Watson AI Lab, Cisco	Github
Emu3: Next-Token Prediction is All You Need	Emu3	arXiv	2024-09-27	BAAI	Github
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Qwen2-VL	arXiv	2024-09-18	Alibaba	Github
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	VILA-U	arXiv	2024-09-06	THU, MIT, NVIDIA, UC Berkeley, UC San Diego	Github
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	xGen-MM	arXiv	2024-08-16	Salesforce, UW	Github
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation	ANOLE	arXiv	2024-06-08	SJTU, Shanghai AI Lab, FDU, GAIR	Github
Chameleon: Mixed-Modal Early-Fusion Foundation Models	Chameleon	arXiv	2024-05-16	Meta FAIR	Github
GPT-4o System Card	GPT-4o	arXiv	2024-05-13	OpenAI	-
WorldGPT: Empowering LLM as Multimodal World Model	WorldGPT	MM	2024-04-28	ZJU, NUS	Github
GiT: Towards Generalist Vision Transformer through Universal Language Interface	GiT	ECCV	2024-03-14	PKU, Max Planck Institute for Informatics, CUHK, ETH Zurich	Github
UniCode: Learning a Unified Codebook for Multimodal Large Language Models	Unicode	ECCV	2024-03-14	BAAI, PKU	-
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	Video-LaVIT	ICML	2024-02-05	PKU, Kuaishou	Github
Scalable pre-training of large autoregressive image models	AIM	ICML	2024-01-16	Apple	Github
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action	Unified-IO 2	CVPR	2023-12-28	Allen Institute for AI, UIUC, UW	Github
Gemini: A Family of Highly Capable Multimodal Models	Gemeni	arXiv	2023-12-06	Google	-
Sequential Modeling Enables Scalable Learning for Large Vision Models	LVM	CVPR	2023-12-01	UC Berkeley, JHU	Github
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model	ShapeGPT	arXiv	2023-11-29	FDU, Tencent, ShanghaiTech, Zhejiang Lab	Github
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models	TEAL	arXiv	2023-11-08	Tencent	-
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	LaVIT	ICLR	2023-09-09	PKU, Kuaishou	Github
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks	Unified-IO	ICLR	2022-06-17	Allen Institute for AI, UW	Github
Write and Paint: Generative Vision-Language Models are Unified Modal Learners	DAVINCI	ICLR	2022-06-15	HKUST, ByteDance, SJTU	Github
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training	DU-VLG	ACL	2022-03-17	Baidu	-

Diffusion

Title	Name	Venue	Date	Organization	Open Source
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows	OmniFlow	arXiv	2024-12-02	UCLA, Panasonic AI Research, Saleforce	-
MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks	MoTe	arXiv	2024-11-29	HKU, NUS, ZJU	-
LaVin-DiT: Large Vision Diffusion Transformer	LaVin-DiT	arXiv	2024-11-18	USYD, NUS, UniMelb, AIsphere	Github
Unimotion: Unifying 3D Human Motion Synthesis and Understanding	Unimotion	arXiv	2024-09-24	Tuebingen U, Max Planck Institute for Informatics	Github
GenRec: Unifying Video Generation and Recognition with Diffusion Models	GenRec	NeurIPS	2024-08-27	FDU, Shanghai Collaborative Innovation Center of Intelligent Visual Computing, UMD	Github
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale	UniDiffuser	ICML	2023-03-12	THU, Shengshu, RUC, BAAI, Pazhou Lab	Github

Autoregression and Diffusion

Title	Name	Venue	Date	Organization	Open Source
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning	MetaMorph	arXiv	2024-12-18	Meta, NYU	-
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance	ILLUME	arXiv	2024-12-09	HUAWEI	-
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads	Orthus	arXiv	2024-11-28	Kuaishou	-
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	JanusFlow	arXiv	2024-11-12	DeepSeek, PKU, HKU, THU	Github
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation	PUMA	arXiv	2024-10-17	CUHK, HKU, SenseTime, Shanghai AI Lab, THU	Github
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	Vitron	NeurIPS	2024-09-26	Skywork AI, NUS, NTU	Github
MIO: A Foundation Model on Multimodal Tokens	MIO	arXiv	2024-09-26	BHU, 01.AI, M-A-P, PolyU, UAlberta, UWaterloo, UoM, CAS, PKU, HKUST	Github
MonoFormer: One Transformer for Both Diffusion and Autoregression	MonoFormer	arXiv	2024-09-24	Baidu, UTS	Github
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	Show-o	arXiv	2024-08-22	NUS, ByteDance	Github
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model	TransFusion	arXiv	2024-08-20	Meta, Waymo, USC	Github
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model	UnifiedMLLM	arXiv	2024-08-05	ByteDance, FDU, USTC	Github
Generative Visual Instruction Tuning	GenLLaVA	arXiv	2024-06-17	Rice, Google DeepMind	Github
Harmonizing Visual Text Comprehension and Generation	TextHarmony	NeurIPS	2024-06-23	ECNU, ByteDance	Github
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	VisionLLM v2	arXiv	2024-06-12	Shanghai AI Lab, HKU, THU, BIT, HKUST, NJU, SenseTime	Github
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	SEED-X	arXiv	2024-04-22	Tencent	Github
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	AnyGPT	arXiv	2024-02-19	FDU, Multimodal Art Projection Research Community, Shanghai AI Lab	Github
MM-Interleaved: Interleaved Image-Text Generation via Multi-modal Feature Synchronizer	MM-Interleaved	arXiv	2024-01-18	Shanghai AI Lab, CUHK, THU, SenseTime, UofT	Github
Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models	Uni3D-LLM	arXiv	2024-01-09	Shanghai AI Lab, DUT, SDU	-
Generative multimodal models are in-context learners	Emu2	CVPR	2023-12-20	BAAI, THU, PKU	Github
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	VL-GPT	arXiv	2023-12-14	XJTU, Tencent, HKU	Github
GPT4Point: A Unified Framework for Point-Language Understanding and Generation	GPT4Point	CVPR	2023-12-05	HKU, CUHK, FDU, SJTU, Shanghai AI Lab	Github
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	CoDi-2	CVPR	2023-11-30	UC Berkeley, Microsoft, Zoom, UNC Chapel Hill	Github
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation	GPT4Video	MM	2023-11-25	USYD, Tencent	Github
DreamLLM: Synergistic Multimodal Comprehension and Creation	DreamLLM	ICLR	2023-09-20	XJTU, IIISCT, MEGVII Technology, THU, HUST, Shanghai AI Lab, Shanghai Qi Zhi Institute	Github
NExT-GPT: Any-to-Any Multimodal LLM	NExT-GPT	ICML	2023-09-11	NUS	Github
Emu: Generative Pretraining in Multimodality	Emu	ICLR	2023-07-11	BAAI, THU, PKU	Github
Generating Images with Multimodal Language Models	GILL	NeurIPS	2023-05-26	CMU	Github

👍 Related Survey

Here, we also recommend some wonderful related surveys.

🫶 Citation

If you find our work useful, you can cite the paper as following formats. We will appreciate it a lot.

Survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

@article{xie2024arvfm,
  title={Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective},
  author={Xie, Shenghao and Zu, Wenqiang and Zhao, Mingyang and Su, Duo and Liu, Shilong and Shi, Ruohua and Li, Guoqi and Zhang, Shanghang and Ma, Lei},
  journal={arXiv preprint arXiv:2410.22217},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
img		img
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Vision Foundation Models Unifying Understanding and Generation

🔥 News

🍻 Contributions

📃 Table of Contents

Vision Foundation Models Unifying Understanding and Generation

Autoregression

Diffusion

Autoregression and Diffusion

👍 Related Survey

🫶 Citation

Survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

The full version of our survey is coming soon!

About

Releases

Packages

License

zikai1/Awesome-UGVFM

Folders and files

Latest commit

History

Repository files navigation

Awesome Vision Foundation Models Unifying Understanding and Generation

🔥 News

🍻 Contributions

📃 Table of Contents

Vision Foundation Models Unifying Understanding and Generation

Autoregression

Diffusion

Autoregression and Diffusion

👍 Related Survey

🫶 Citation

Survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

The full version of our survey is coming soon!

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages