Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
DerrickWang005 authored Nov 23, 2024
1 parent 57fc6bc commit 7476363
Show file tree
Hide file tree
Showing 3 changed files with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ <h1 class="title is-1 mathvista_other">
<div class="column is-four-fifths">
<div class="content has-text-justified">
<p>
Large language models (LLMs) like GPT and LLaMA have rapidly gained widespread attention and transformed the field, demonstrating the strong capability to handle a wide range of language tasks within a unified framework. This breakthrough of integrating diverse language tasks into a single large model has sparked momentum to develop similar large models for computer vision. The potential to create large vision models~(LVMs) capable of generalizing across multiple vision tasks represents a promising step toward a more versatile, scalable, and efficient approach to vision-based AI.
Large language models (LLMs) like GPT and LLaMA have rapidly gained widespread attention and transformed the field, demonstrating the strong capability to handle a wide range of language tasks within a unified framework. This breakthrough of integrating diverse language tasks into a single large model has sparked momentum to develop similar large models for computer vision. The potential to create large vision models (LVMs) capable of generalizing across multiple vision tasks represents a promising step toward a more versatile, scalable, and efficient approach to vision-based AI.
</p>
<p>
However, constructing LVMs presents greater complexity than LLMs due to the inherently diverse and high-dimensional nature of vision data, as well as the need to handle variations in scale, perspective, and lighting across tasks. To handle the problem, recent work has developed a sequential modeling method that learns from purely vision data by representing images, videos, and annotations in a unified "visual sentence" format. This method enables the model to predict sequential vision tokens from a vast dataset, entirely independent of language-based inputs. Although this method has shown promising results in diverse vision tasks, it faces two primary challenges. Specifically, the first issue concerns the efficiency limitations inherent in autoregressive sequence modeling, as it demands token-by-token prediction, which is computationally intensive for high-dimensional vision data. The second issue is the disruption of spatial coherence when converting vision data into a sequential format, which compromises the preservation of spatial dependencies crucial for performance in vision tasks.
Expand Down Expand Up @@ -462,7 +462,7 @@ <h1 class="title is-1 mathvista_other" id="citation">Citation</h1>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<pre><code>
@article{run2024mmevol,
@article{wang2024lavin,
title={LaVin-DiT: Large Vision Diffusion Transformer},
author={Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu},
journal={arXiv preprint arXiv:2411.11505},
Expand Down
Binary file modified static/figs/exp5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified static/figs/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 7476363

Please sign in to comment.