HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

[Project Page] [arXiv] [Paper]

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction,

Chen Bao, Jiarui Xu, Xiaolong Wang†, Abhinav Gupta†, Homanga Bharadhwaj†

† Equal Advising

HandsOnVLM is a novel vision-language model for hand-object interction prediction. This repo contains the training and inference code for HandsOnVLM.

Installation

Clone the repo and Create a conda env with all the Python dependencies.

git clone [email protected]:Kami-code/HandsOnVLM-release.git
cd HandsOnVLM-release
conda create -n handsonvlm python=3.10
conda activate handsonvlm
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.4 cuda -c pytorch -c nvidia
pip install -e .
pip install flash-attn==2.6.3 --no-build-isolation

File Structure

The file structure is listed as follows:

hoi_forecast/: dataset structure and helper functions for handsonvlm and hoi-forecast

handsonvlm/: model and training code for handsonvlm

lita/: model and related code for lita

llava/: model and related code for llava

Dataset

See Preparing Datasets for HandsOnVLM.

Weights

Model Name	LLM version	Weights
HandsOnVLM-7B	Vicuna-7B-v1.3	Link
HandsOnVLM-13B	Vicuna-13B-v1.3	Link

Train

The HandsOnVLM model only uses one stage supervised fine-tuning. The linear projection is initialized by the LLaVA pretrained weights. The training uses 8 H100 GPUs with 80GB memory.

Prepare public checkpoints from Vicuna, LLaVA

git clone https://huggingface.co/lmsys/vicuna-13b-v1.3
git clone https://huggingface.co/liuhaotian/llava-pretrain-vicuna-13b-v1.3
mv vicuna-13b-v1.3 vicuna-v1-3-13b
mv llava-pretrain-vicuna-13b-v1.3 llava-vicuna-v1-3-13b-pretrain

Similarly for the 7B checkpoints, replace 13b with 7b in the above commands.

Supervised Fine-tuning

The HandsOnVLM model can be trained using the supervised fine-tuning script here. To co-train with LITA's task, also update LITA dataset directory in data_path (--data_path) and checkpoint directory (./checkpoints).

sh scripts/finetune.sh

Evaluation

We provide the evaluation pipeline for the Reasoning-based EPIC-KITCHEN-100 dataset.

python -m handsonvlm.evaluation.evaluate --model-path ./checkpoints/handsonvlm-7b
python -m handsonvlm.evaluation.evaluate --model-path ./checkpoints/handsonvlm-7b --use_reason

CLI Inference

We provide a script for chat with HandsOnVLM with images or videos.

python -m handsonvlm.evaluation.chat --model-path ./checkpoints/handsonvlm-7b --visual-path <path-to-HandsOnVLM-release>/docs/epic_kitchen.jpg
Human: Where should my hand move to if I want to reach the oven?

The response could be: Agent: response: Certainly! The hand trajectory for reach the oven is as follows: <hand_traj> <hand_traj> <hand_traj> <hand_traj> .</s>

Bibtex

If the contents of this repository are helpful, please consider citing our paper

@misc{bao2024handsonvlmvisionlanguagemodelshandobject,
      title={HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction}, 
      author={Chen Bao and Jiarui Xu and Xiaolong Wang and Abhinav Gupta and Homanga Bharadhwaj},
      year={2024},
      eprint={2412.13187},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}

Acknowledgements

This repository builds upon code from LITA and hoi-forecast. We thank the authors of these papers for open-sourcing their code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

[Project Page] [arXiv] [Paper]

Installation

File Structure

Dataset

Weights

Train

Prepare public checkpoints from Vicuna, LLaVA

Supervised Fine-tuning

Evaluation

CLI Inference

Bibtex

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
docs		docs
handsonvlm		handsonvlm
hoi_forecast		hoi_forecast
lita		lita
llava		llava
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

License

Kami-code/HandsOnVLM-release

Folders and files

Latest commit

History

Repository files navigation

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

[Project Page] [arXiv] [Paper]

Installation

File Structure

Dataset

Weights

Train

Prepare public checkpoints from Vicuna, LLaVA

Supervised Fine-tuning

Evaluation

CLI Inference

Bibtex

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages