We provide two multimodal large language models, Hunyuan-Captioner and Dialogen. The former provides fine-grained text descriptions for training data, while the latter enhances the user's prompt input during inference and supports multi-turn text-to-image generation.
Hunyuan-Captioner meets the need of text-to-image techniques by maintaining a high degree of image-text consistency. It can generate high-quality image descriptions from a variety of angles, including object description, objects relationships, background information, image style, etc. Our code is based on LLaVA implementation.
a. Install dependencies
The dependencies and installation are basically the same as the base model.
b. Model download
# Use the huggingface-cli tool to download the model.
huggingface-cli download Tencent-Hunyuan/HunyuanCaptioner --local-dir ./ckpts/captioner
Our model supports three different modes including: directly generating Chinese caption, generating Chinese caption based on specific knowledge, and directly generating English caption. The injected information can be either accurate cues or noisy labels (e.g., raw descriptions crawled from the internet). The model is capable of generating reliable and accurate descriptions based on both the inserted information and the image content.
Mode | Prompt Template | Description |
---|---|---|
caption_zh | 描述这张图片 | Caption in Chinese |
insert_content | 根据提示词“{}”,描述这张图片 | Caption with inserted knowledge |
caption_en | Please describe the content of this image | Caption in English |
a. Single picture inference in Chinese
python mllm/caption_demo.py --mode "caption_zh" --image_file "mllm/images/demo1.png" --model_path "./ckpts/captioner"
b. Insert specific knowledge into caption
python mllm/caption_demo.py --mode "insert_content" --content "宫保鸡丁" --image_file "mllm/images/demo2.png" --model_path "./ckpts/captioner"
c. Single picture inference in English
python mllm/caption_demo.py --mode "caption_en" --image_file "mllm/images/demo3.png" --model_path "./ckpts/captioner"
d. Multiple pictures inference in Chinese
### Convert multiple pictures to csv file.
python mllm/make_csv.py --img_dir "mllm/images" --input_file "mllm/images/demo.csv"
### Multiple pictures inference
python mllm/caption_demo.py --mode "caption_zh" --input_file "mllm/images/demo.csv" --output_file "mllm/images/demo_res.csv" --model_path "./ckpts/captioner"
(Optional) To convert the output csv file to Arrow format, please refer to Data Preparation #3 for detailed instructions.
To launch a Gradio demo locally, please execute the following commands sequentially. Ensure each command is running in the background. For more detailed instructions, please refer to LLaVA.
cd mllm
python -m llava.serve.controller --host 0.0.0.0 --port 10000
python -m llava.serve.gradio_web_server --controller http://0.0.0.0:10000 --model-list-mode reload --port 443
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://0.0.0.0:10000 --port 40000 --worker http://0.0.0.0:40000 --model-path "../ckpts/captioner" --model-name LlavaMistral
Then the demo can be accessed through http://0.0.0.0:443. It should be noted that the 0.0.0.0 here needs to be X.X.X.X with your server IP.
We additionally provide inference commands for DialogGen.
cd HunyuanDiT
python mllm/dialoggen_demo.py --prompt "画一只小猫"
# Start a multi-turn T2I generation UI.
# If your GPU memory is less than 32GB, use '--load-4bit' to enable 4-bit quantization, which requires at least 22GB of memory.
python app/multiTurnT2I_app.py
We provide TensorRT-LLM (precision: int8 weight-only) version of Hunyuan-Captioner for inference acceleration(for Linux). See: Acceleration