Solving all language tasks and exploring scaling laws with LLMs are so cool. How about building vision foundation models to unlock the vision AGI? Also, we are now good at handling vision tasks seperately. How about unifying vision understanding and generation to solve vision tasks all together?
As the long-term roadmap for building a generalist foundation model to solve all vision tasks has not yet been fully determined, we conduct the first survey to provide comprehensive summary and in-depth analysis on vision foundation models unifying both understanding and generation.
This repository provides a curated list of related papers and resources, come take a look and let's share insights on unresolved challenges together! Please stay tuned and give us a π if you are interested in our project, we will update the latest advancements both on our paper and this repo continuously.
[2024-12-24]
We open this repo, looking forward to your feedbacks.
[2024-12-02]
We build this repo, a complete and revised version of our survey is coming soon.
[2024-10-29]
We have released the survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective.
Both suggestions about our project and sharing of your work are welcomed! Please feel free to open an issue, pull a request or send us an email. We will update them regularly.
Title | Name | Venue | Date | Organization | Open Source |
---|---|---|---|---|---|
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows | OmniFlow | arXiv | 2024-12-02 | UCLA, Panasonic AI Research, Saleforce | - |
MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks | MoTe | arXiv | 2024-11-29 | HKU, NUS, ZJU | - |
LaVin-DiT: Large Vision Diffusion Transformer | LaVin-DiT | arXiv | 2024-11-18 | USYD, NUS, UniMelb, AIsphere | Github |
Unimotion: Unifying 3D Human Motion Synthesis and Understanding | Unimotion | arXiv | 2024-09-24 | Tuebingen U, Max Planck Institute for Informatics | Github |
GenRec: Unifying Video Generation and Recognition with Diffusion Models | GenRec | NeurIPS | 2024-08-27 | FDU, Shanghai Collaborative Innovation Center of Intelligent Visual Computing, UMD | Github |
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale | UniDiffuser | ICML | 2023-03-12 | THU, Shengshu, RUC, BAAI, Pazhou Lab | Github |
Here, we also recommend some wonderful related surveys.
- Diffusion Models in Vision: A Survey | Project Repo
- Autoregressive Models in Vision: A Survey | Project Repo
- A survey on multimodal large language models | Project Repo
- Foundational Models Defining a New Era in Vision: A Survey and Outlook | Project Repo
- Multi-modal generative ai: Multi-modal llm, diffusion and beyond
If you find our work useful, you can cite the paper as following formats. We will appreciate it a lot.
Survey v1: Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
@article{xie2024arvfm,
title={Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective},
author={Xie, Shenghao and Zu, Wenqiang and Zhao, Mingyang and Su, Duo and Liu, Shilong and Shi, Ruohua and Li, Guoqi and Zhang, Shanghang and Ma, Lei},
journal={arXiv preprint arXiv:2410.22217},
year={2024}
}