update

DerrickWang005 · Nov 23, 2024 · 7476363 · 7476363
1 parent 57fc6bc
commit 7476363
Show file tree

Hide file tree

Showing 3 changed files with 2 additions and 2 deletions.
diff --git a/index.html b/index.html
@@ -158,7 +158,7 @@ <h1 class="title is-1 mathvista_other">
       <div class="column is-four-fifths">
         <div class="content has-text-justified">
           <p>
-            Large language models (LLMs) like GPT and LLaMA have rapidly gained widespread attention and transformed the field, demonstrating the strong capability to handle a wide range of language tasks within a unified framework. This breakthrough of integrating diverse language tasks into a single large model has sparked momentum to develop similar large models for computer vision. The potential to create large vision models~(LVMs) capable of generalizing across multiple vision tasks represents a promising step toward a more versatile, scalable, and efficient approach to vision-based AI.
+            Large language models (LLMs) like GPT and LLaMA have rapidly gained widespread attention and transformed the field, demonstrating the strong capability to handle a wide range of language tasks within a unified framework. This breakthrough of integrating diverse language tasks into a single large model has sparked momentum to develop similar large models for computer vision. The potential to create large vision models (LVMs) capable of generalizing across multiple vision tasks represents a promising step toward a more versatile, scalable, and efficient approach to vision-based AI.
           </p>
           <p>
             However, constructing LVMs presents greater complexity than LLMs due to the inherently diverse and high-dimensional nature of vision data, as well as the need to handle variations in scale, perspective, and lighting across tasks. To handle the problem, recent work has developed a sequential modeling method that learns from purely vision data by representing images, videos, and annotations in a unified "visual sentence" format. This method enables the model to predict sequential vision tokens from a vast dataset, entirely independent of language-based inputs. Although this method has shown promising results in diverse vision tasks, it faces two primary challenges. Specifically, the first issue concerns the efficiency limitations inherent in autoregressive sequence modeling, as it demands token-by-token prediction, which is computationally intensive for high-dimensional vision data. The second issue is the disruption of spatial coherence when converting vision data into a sequential format, which compromises the preservation of spatial dependencies crucial for performance in vision tasks.
@@ -462,7 +462,7 @@ <h1 class="title is-1 mathvista_other" id="citation">Citation</h1>
 <section class="section" id="BibTeX">
   <div class="container is-max-desktop content">
     <pre><code>
-      @article{run2024mmevol,
+      @article{wang2024lavin,
               title={LaVin-DiT: Large Vision Diffusion Transformer},
               author={Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu},
               journal={arXiv preprint arXiv:2411.11505},

diff --git a/static/figs/exp5.png b/static/figs/exp5.png
diff --git a/static/figs/logo.png b/static/figs/logo.png