TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings (TG-LLaVA)

This repo is the official implementation of our AAAI2025 paper "TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings" (arXiv).

Overview/Abstract

Currently, inspired by the success of vision-language models (VLMs), an increasing number of researchers are focusing on improving VLMs and have achieved promising results. However, most existing methods concentrate on optimizing the connector and enhancing the language model component, while neglecting improvements to the vision encoder itself. In contrast, we propose Text Guided LLaVA (TG-LLaVA) in this paper, which optimizes VLMs by guiding the vision encoder with text, offering a new and orthogonal optimization direction. Specifically, inspired by the purpose-driven logic inherent in human behavior, we use learnable latent embeddings as a bridge to analyze textual instruction and add the analysis results to the vision encoder as guidance, refining it. Subsequently, another set of latent embeddings extracts additional detailed text-guided information from high-resolution local patches as auxiliary information. Finally, with the guidance of text, the vision encoder can extract text-related features, similar to how humans focus on the most relevant parts of an image when considering a question. This results in generating better answers. Experiments on various datasets validate the effectiveness of the proposed method. Remarkably, without the need for additional training data, our proposed method can bring more benefits to the baseline (LLaVA-1.5) compared with other concurrent methods. Furthermore, the proposed method consistently brings improvement in different settings.

Usage

Data preparation

558K subset of the LAION-CC-SBU & llava_v1_5_mix665k

You can follow LLaVA 1.5 to download them.

Install

Our installation process is the same as LLaVA 1.5

Clone this repo:

git clone https://github.com/AIDC-AI/TG-LLaVA.git
cd TG-LLaVA

Install requirements:

conda create -n tg_llava python=3.10 -y
conda activate tg_llava
pip install --upgrade pip
pip install -e .

Train TG-LLaVA

Because we made pure architectural improvements on LLaVA 1.5, there is no difference in training paradigm compared to its two stages: (1) Feature alignment stage: Utilizing the 558K subset of the LAION-CC-SBU dataset we selected, we connect a frozen pre-trained visual encoder with a frozen language model (LLM); (2) Visual instruction tuning stage: Utilizing 150K multimodal instructions generated by GPT to follow the data, as well as approximately 515K VQA data from academic-oriented tasks, to train the model to follow multimodal instructions.

We use 8 H100 GPUs with 80GB memory, which is the same as A100.

We use the same set of hyperparameters as LLaVA 1.5 in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

Pretraining

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length
LLaVA-v1.5-13B	256	1e-3	1	2048
Meta-Llama-3-8B-Instruct	256	1e-3	1	2048
Qwen2-7B-Insttuct	256	1e-3	1	2048

Finetuning

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length
LLaVA-v1.5-13B	128	2e-5	1	2048
Meta-Llama-3-8B-Instruct	128	2e-5	1	2048
Qwen2-7B-Insttuct	128	2e-5	1	2048

Pretraining

./scripts/train/pretrain_cross_zeroAdd.sh

It should be noted that we have introduced an additional text_tower field:

--text_tower=openai/clip-vit-large-patch14-336

You can change it to google/siglip-so400m-patch14-384 for training.

Change the --model_name_or_path to specify LLM to be used.

Finetuning

./scripts/train/finetune_cross_zeroAdd.sh

Evaluation

The evaluation results of our model come from VLMEvalKit, and we only need to replace the Llava 1.5 model file to complete the same evaluation process. If you wish to conduct a separate evaluation, please refer to the Evaluation section of LLaVA 1.5, as our evaluation process is not different from theirs. Please modify the corresponding model file.

Inference

We provide an inference wrapper in tg_llava/deploy/runner.py, which can be used as:

python tg_llava/deploy/runner.py

Citation

If this project is helpful for you, you can cite our paper:

@article{yan2024tg,
  title={TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings},
  author={Yan, Dawei and Li, Pengcheng and Li, Yang and Chen, Hao and Chen, Qingguo and Luo, Weihua and Dong, Wei and Yan, Qingsen and Zhang, Haokui and Shen, Chunhua},
  journal={arXiv preprint arXiv:2409.09564},
  year={2024}
}

License

The model is licensed under Apache License Version 2 (SPDX-License-identifier: Apache-2.0).

Disclaimer

We used compliance checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to complex data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
tg_llava		tg_llava
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings (TG-LLaVA)

Overview/Abstract

Usage

Data preparation

Install

Train TG-LLaVA

Citation

License

Disclaimer

About

Releases

Packages

Languages

License

AIDC-AI/TG-LLaVA

Folders and files

Latest commit

History

Repository files navigation

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings (TG-LLaVA)

Overview/Abstract

Usage

Data preparation

Install

Train TG-LLaVA

Citation

License

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages