Skip to content

A collection of vision foundation models unifying understanding and generation.

License

Notifications You must be signed in to change notification settings

zikai1/Awesome-UGVFM

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

90 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Vision Foundation Models Unifying Understanding and Generation

Solving all language tasks and exploring scaling laws with LLMs are so cool. How about building vision foundation models to unlock the vision AGI? Also, we are now good at handling vision tasks seperately. How about unifying vision understanding and generation to solve vision tasks all together?
As the long-term roadmap for building a generalist foundation model to solve all vision tasks has not yet been fully determined, we conduct the first survey to provide comprehensive summary and in-depth analysis on vision foundation models unifying both understanding and generation.
This repository provides a curated list of related papers and resources, come take a look and let's share insights on unresolved challenges together! Please stay tuned and give us a 🌟 if you are interested in our project, we will update the latest advancements both on our paper and this repo continuously.


πŸ”₯ News

[2024-12-24] We open this repo, looking forward to your feedbacks.

[2024-12-02] We build this repo, a complete and revised version of our survey is coming soon.

[2024-10-29] We have released the survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective.


🍻 Contributions

Both suggestions about our project and sharing of your work are welcomed! Please feel free to open an issue, pull a request or send us an email. We will update them regularly.


πŸ“ƒ Table of Contents


Vision Foundation Models Unifying Understanding and Generation

Autoregression

Title Name Venue Date Organization Open Source
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling InternVL 2.5 arXiv 2024-12-06 Shanghai AI Lab, SenseTime, THU, NJU, FDU, CUHK, SJTU Github
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation TokenFlow arXiv 2024-12-04 ByteDance Github
JetFormer: An Autoregressive Generative Model of Raw Images and Text JetFormer arXiv 2024-11-29 Google Deepmind -
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding MUSE-VL arXiv 2024-11-26 ByteDance -
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing UniPose arXiv 2024-11-25 CAS -
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE SAR3D arXiv 2024-11-25 NTU, Shanghai AI Lab Github
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models LLaMA-Mesh arXiv 2024-11-14 THU, NVIDIA Github
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation Janus arXiv 2024-10-17 DeepSeek, HKU, PKU Github
UniMuMo: Unified Text, Music and Motion Generation UniMuMo arXiv 2024-10-06 CUHK, UW, UBC, UMass Amherst, MIT-IBM Watson AI Lab, Cisco Github
Emu3: Next-Token Prediction is All You Need Emu3 arXiv 2024-09-27 BAAI Github
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Qwen2-VL arXiv 2024-09-18 Alibaba Github
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation VILA-U arXiv 2024-09-06 THU, MIT, NVIDIA, UC Berkeley, UC San Diego Github
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models xGen-MM arXiv 2024-08-16 Salesforce, UW Github
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation ANOLE arXiv 2024-06-08 SJTU, Shanghai AI Lab, FDU, GAIR Github
Chameleon: Mixed-Modal Early-Fusion Foundation Models Chameleon arXiv 2024-05-16 Meta FAIR Github
GPT-4o System Card GPT-4o arXiv 2024-05-13 OpenAI -
WorldGPT: Empowering LLM as Multimodal World Model WorldGPT MM 2024-04-28 ZJU, NUS Github
GiT: Towards Generalist Vision Transformer through Universal Language Interface GiT ECCV 2024-03-14 PKU, Max Planck Institute for Informatics, CUHK, ETH Zurich Github
UniCode: Learning a Unified Codebook for Multimodal Large Language Models Unicode ECCV 2024-03-14 BAAI, PKU -
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Video-LaVIT ICML 2024-02-05 PKU, Kuaishou Github
Scalable pre-training of large autoregressive image models AIM ICML 2024-01-16 Apple Github
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action Unified-IO 2 CVPR 2023-12-28 Allen Institute for AI, UIUC, UW Github
Gemini: A Family of Highly Capable Multimodal Models Gemeni arXiv 2023-12-06 Google -
Sequential Modeling Enables Scalable Learning for Large Vision Models LVM CVPR 2023-12-01 UC Berkeley, JHU Github
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model ShapeGPT arXiv 2023-11-29 FDU, Tencent, ShanghaiTech, Zhejiang Lab Github
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models TEAL arXiv 2023-11-08 Tencent -
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization LaVIT ICLR 2023-09-09 PKU, Kuaishou Github
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks Unified-IO ICLR 2022-06-17 Allen Institute for AI, UW Github
Write and Paint: Generative Vision-Language Models are Unified Modal Learners DAVINCI ICLR 2022-06-15 HKUST, ByteDance, SJTU Github
DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training DU-VLG ACL 2022-03-17 Baidu -

Diffusion

Title Name Venue Date Organization Open Source
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows OmniFlow arXiv 2024-12-02 UCLA, Panasonic AI Research, Saleforce -
MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks MoTe arXiv 2024-11-29 HKU, NUS, ZJU -
LaVin-DiT: Large Vision Diffusion Transformer LaVin-DiT arXiv 2024-11-18 USYD, NUS, UniMelb, AIsphere Github
Unimotion: Unifying 3D Human Motion Synthesis and Understanding Unimotion arXiv 2024-09-24 Tuebingen U, Max Planck Institute for Informatics Github
GenRec: Unifying Video Generation and Recognition with Diffusion Models GenRec NeurIPS 2024-08-27 FDU, Shanghai Collaborative Innovation Center of Intelligent Visual Computing, UMD Github
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale UniDiffuser ICML 2023-03-12 THU, Shengshu, RUC, BAAI, Pazhou Lab Github

Autoregression and Diffusion

Title Name Venue Date Organization Open Source
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning MetaMorph arXiv 2024-12-18 Meta, NYU -
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance ILLUME arXiv 2024-12-09 HUAWEI -
Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads Orthus arXiv 2024-11-28 Kuaishou -
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation JanusFlow arXiv 2024-11-12 DeepSeek, PKU, HKU, THU Github
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation PUMA arXiv 2024-10-17 CUHK, HKU, SenseTime, Shanghai AI Lab, THU Github
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing Vitron NeurIPS 2024-09-26 Skywork AI, NUS, NTU Github
MIO: A Foundation Model on Multimodal Tokens MIO arXiv 2024-09-26 BHU, 01.AI, M-A-P, PolyU, UAlberta, UWaterloo, UoM, CAS, PKU, HKUST Github
MonoFormer: One Transformer for Both Diffusion and Autoregression MonoFormer arXiv 2024-09-24 Baidu, UTS Github
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation Show-o arXiv 2024-08-22 NUS, ByteDance Github
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model TransFusion arXiv 2024-08-20 Meta, Waymo, USC Github
UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model UnifiedMLLM arXiv 2024-08-05 ByteDance, FDU, USTC Github
Generative Visual Instruction Tuning GenLLaVA arXiv 2024-06-17 Rice, Google DeepMind Github
Harmonizing Visual Text Comprehension and Generation TextHarmony NeurIPS 2024-06-23 ECNU, ByteDance Github
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks VisionLLM v2 arXiv 2024-06-12 Shanghai AI Lab, HKU, THU, BIT, HKUST, NJU, SenseTime Github
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation SEED-X arXiv 2024-04-22 Tencent Github
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling AnyGPT arXiv 2024-02-19 FDU, Multimodal Art Projection Research Community, Shanghai AI Lab Github
MM-Interleaved: Interleaved Image-Text Generation via Multi-modal Feature Synchronizer MM-Interleaved arXiv 2024-01-18 Shanghai AI Lab, CUHK, THU, SenseTime, UofT Github
Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models Uni3D-LLM arXiv 2024-01-09 Shanghai AI Lab, DUT, SDU -
Generative multimodal models are in-context learners Emu2 CVPR 2023-12-20 BAAI, THU, PKU Github
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation VL-GPT arXiv 2023-12-14 XJTU, Tencent, HKU Github
GPT4Point: A Unified Framework for Point-Language Understanding and Generation GPT4Point CVPR 2023-12-05 HKU, CUHK, FDU, SJTU, Shanghai AI Lab Github
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation CoDi-2 CVPR 2023-11-30 UC Berkeley, Microsoft, Zoom, UNC Chapel Hill Github
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation GPT4Video MM 2023-11-25 USYD, Tencent Github
DreamLLM: Synergistic Multimodal Comprehension and Creation DreamLLM ICLR 2023-09-20 XJTU, IIISCT, MEGVII Technology, THU, HUST, Shanghai AI Lab, Shanghai Qi Zhi Institute Github
NExT-GPT: Any-to-Any Multimodal LLM NExT-GPT ICML 2023-09-11 NUS Github
Emu: Generative Pretraining in Multimodality Emu ICLR 2023-07-11 BAAI, THU, PKU Github
Generating Images with Multimodal Language Models GILL NeurIPS 2023-05-26 CMU Github

πŸ‘ Related Survey

Here, we also recommend some wonderful related surveys.

  1. Diffusion Models in Vision: A Survey | Project Repo
  2. Autoregressive Models in Vision: A Survey | Project Repo
  3. A survey on multimodal large language models | Project Repo
  4. Foundational Models Defining a New Era in Vision: A Survey and Outlook | Project Repo
  5. Multi-modal generative ai: Multi-modal llm, diffusion and beyond

🫢 Citation

If you find our work useful, you can cite the paper as following formats. We will appreciate it a lot.

Survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

@article{xie2024arvfm,
  title={Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective},
  author={Xie, Shenghao and Zu, Wenqiang and Zhao, Mingyang and Su, Duo and Liu, Shilong and Shi, Ruohua and Li, Guoqi and Zhang, Shanghang and Ma, Lei},
  journal={arXiv preprint arXiv:2410.22217},
  year={2024}
}

The full version of our survey is coming soon!


About

A collection of vision foundation models unifying understanding and generation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published