HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction,
Chen Bao, Jiarui Xu, Xiaolong Wang†, Abhinav Gupta†, Homanga Bharadhwaj†
† Equal Advising
HandsOnVLM is a novel vision-language model for hand-object interction prediction. This repo contains the training and inference code for HandsOnVLM.
- Clone the repo and Create a conda env with all the Python dependencies.
git clone [email protected]:Kami-code/HandsOnVLM-release.git
cd HandsOnVLM-release
conda create -n handsonvlm python=3.10
conda activate handsonvlm
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.4 cuda -c pytorch -c nvidia
pip install -e .
pip install flash-attn==2.6.3 --no-build-isolation
The file structure is listed as follows:
hoi_forecast/
: dataset structure and helper functions for handsonvlm and hoi-forecast
handsonvlm/
: model and training code for handsonvlm
lita/
: model and related code for lita
llava/
: model and related code for llava
See Preparing Datasets for HandsOnVLM.
Model Name | LLM version | Weights |
---|---|---|
HandsOnVLM-7B | Vicuna-7B-v1.3 | Link |
HandsOnVLM-13B | Vicuna-13B-v1.3 | Link |
The HandsOnVLM model only uses one stage supervised fine-tuning. The linear projection is initialized by the LLaVA pretrained weights. The training uses 8 H100 GPUs with 80GB memory.
git clone https://huggingface.co/lmsys/vicuna-13b-v1.3
git clone https://huggingface.co/liuhaotian/llava-pretrain-vicuna-13b-v1.3
mv vicuna-13b-v1.3 vicuna-v1-3-13b
mv llava-pretrain-vicuna-13b-v1.3 llava-vicuna-v1-3-13b-pretrain
Similarly for the 7B checkpoints, replace 13b
with 7b
in the above commands.
The HandsOnVLM model can be trained using the supervised fine-tuning script here. To co-train with LITA's task, also update LITA dataset directory in data_path (--data_path
) and checkpoint directory (./checkpoints
).
sh scripts/finetune.sh
We provide the evaluation pipeline for the Reasoning-based EPIC-KITCHEN-100 dataset.
python -m handsonvlm.evaluation.evaluate --model-path ./checkpoints/handsonvlm-7b
python -m handsonvlm.evaluation.evaluate --model-path ./checkpoints/handsonvlm-7b --use_reason
We provide a script for chat with HandsOnVLM with images or videos.
python -m handsonvlm.evaluation.chat --model-path ./checkpoints/handsonvlm-7b --visual-path <path-to-HandsOnVLM-release>/docs/epic_kitchen.jpg
Human: Where should my hand move to if I want to reach the oven?
The response could be:
Agent: response: Certainly! The hand trajectory for reach the oven is as follows: <hand_traj> <hand_traj> <hand_traj> <hand_traj> .</s>
If the contents of this repository are helpful, please consider citing our paper
@misc{bao2024handsonvlmvisionlanguagemodelshandobject,
title={HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction},
author={Chen Bao and Jiarui Xu and Xiaolong Wang and Abhinav Gupta and Homanga Bharadhwaj},
year={2024},
eprint={2412.13187},
archivePrefix={arXiv},
primaryClass={cs.CV},
}
This repository builds upon code from LITA and hoi-forecast. We thank the authors of these papers for open-sourcing their code.