This repo is the official implementation of our AAAI2025 paper "TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings" (arXiv).
Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our proposed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.
- 558K subset of the LAION-CC-SBU & llava_v1_5_mix665k
You can follow LLaVA 1.5 to download them.
Our installation process is the same as LLaVA 1.5
- Clone this repo:
git clone https://github.com/AIDC-AI/TG-LLaVA.git
cd TG-LLaVA
- Install requirements:
conda create -n tg_llava python=3.10 -y
conda activate tg_llava
pip install --upgrade pip
pip install -e .
Because we made pure architectural improvements on LLaVA 1.5, there is no difference in training paradigm compared to its two stages: (1) Feature alignment stage: Utilizing the 558K subset of the LAION-CC-SBU dataset we selected, we connect a frozen pre-trained visual encoder with a frozen language model (LLM); (2) Visual instruction tuning stage: Utilizing 150K multimodal instructions generated by GPT to follow the data, as well as approximately 515K VQA data from academic-oriented tasks, to train the model to follow multimodal instructions.
We use 8 H100 GPUs with 80GB memory, which is the same as A100.
We use the same set of hyperparameters as LLaVA 1.5 in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.
- Pretraining
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-v1.5-13B | 256 | 1e-3 | 1 | 2048 | 0 |
Meta-Llama-3-8B-Instruct | 256 | 1e-3 | 1 | 2048 | 0 |
Qwen2-7B-Insttuct | 256 | 1e-3 | 1 | 2048 | 0 |
- Finetuning
Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
---|---|---|---|---|---|
LLaVA-v1.5-13B | 128 | 2e-5 | 1 | 2048 | 0 |
Meta-Llama-3-8B-Instruct | 128 | 2e-5 | 1 | 2048 | 0 |
Qwen2-7B-Insttuct | 128 | 2e-5 | 1 | 2048 | 0 |
- Pretraining
./scripts/train/pretrain_cross_zeroAdd.sh
It should be noted that we have introduced an additional text_tower field:
--text_tower=openai/clip-vit-large-patch14-336
You can change it to google/siglip-so400m-patch14-384
for training.
Change the --model_name_or_path
to specify LLM to be used.
- Finetuning
./scripts/train/finetune_cross_zeroAdd.sh
- Evaluation
The evaluation results of our model come from VLMEvalKit, and we only need to replace the Llava 1.5 model file to complete the same evaluation process. If you wish to conduct a separate evaluation, please refer to the Evaluation section of LLaVA 1.5, as our evaluation process is not different from theirs. Please modify the corresponding model file.
- Inference
We provide an inference wrapper in tg_llava/deploy/runner.py, which can be used as:
python tg_llava/deploy/runner.py
If this project is helpful for you, you can cite our paper:
@article{yan2024tg,
title={TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings},
author={Yan, Dawei and Li, Pengcheng and Li, Yang and Chen, Hao and Chen, Qingguo and Luo, Weihua and Dong, Wei and Yan, Qingsen and Zhang, Haokui and Shen, Chunhua},
journal={arXiv preprint arXiv:2409.09564},
year={2024}
}
The model is licensed under Apache License Version 2 (SPDX-License-identifier: Apache-2.0).
We used compliance checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.