Skip to content

Kami-code/HandsOnVLM-release

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction,

Chen Bao, Jiarui Xu, Xiaolong Wang†, Abhinav Gupta†, Homanga Bharadhwaj

† Equal Advising

HandsOnVLM is a novel vision-language model for hand-object interction prediction. This repo contains the training and inference code for HandsOnVLM.

HandsOnVLM Teaser

Installation

  1. Clone the repo and Create a conda env with all the Python dependencies.
git clone [email protected]:Kami-code/HandsOnVLM-release.git
cd HandsOnVLM-release
conda create -n handsonvlm python=3.10
conda activate handsonvlm
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.4 cuda -c pytorch -c nvidia
pip install -e .
pip install flash-attn==2.6.3 --no-build-isolation

File Structure

The file structure is listed as follows:

hoi_forecast/: dataset structure and helper functions for handsonvlm and hoi-forecast

handsonvlm/: model and training code for handsonvlm

lita/: model and related code for lita

llava/: model and related code for llava

Dataset

See Preparing Datasets for HandsOnVLM.

Weights

Model Name LLM version Weights
HandsOnVLM-7B Vicuna-7B-v1.3 Link
HandsOnVLM-13B Vicuna-13B-v1.3 Link

Train

The HandsOnVLM model only uses one stage supervised fine-tuning. The linear projection is initialized by the LLaVA pretrained weights. The training uses 8 H100 GPUs with 80GB memory.

Prepare public checkpoints from Vicuna, LLaVA

git clone https://huggingface.co/lmsys/vicuna-13b-v1.3
git clone https://huggingface.co/liuhaotian/llava-pretrain-vicuna-13b-v1.3
mv vicuna-13b-v1.3 vicuna-v1-3-13b
mv llava-pretrain-vicuna-13b-v1.3 llava-vicuna-v1-3-13b-pretrain

Similarly for the 7B checkpoints, replace 13b with 7b in the above commands.

Supervised Fine-tuning

The HandsOnVLM model can be trained using the supervised fine-tuning script here. To co-train with LITA's task, also update LITA dataset directory in data_path (--data_path) and checkpoint directory (./checkpoints).

sh scripts/finetune.sh

Evaluation

We provide the evaluation pipeline for the Reasoning-based EPIC-KITCHEN-100 dataset.

python -m handsonvlm.evaluation.evaluate --model-path ./checkpoints/handsonvlm-7b
python -m handsonvlm.evaluation.evaluate --model-path ./checkpoints/handsonvlm-7b --use_reason

CLI Inference

We provide a script for chat with HandsOnVLM with images or videos.

python -m handsonvlm.evaluation.chat --model-path ./checkpoints/handsonvlm-7b --visual-path <path-to-HandsOnVLM-release>/docs/epic_kitchen.jpg
Human: Where should my hand move to if I want to reach the oven?

The response could be: Agent: response: Certainly! The hand trajectory for reach the oven is as follows: <hand_traj> <hand_traj> <hand_traj> <hand_traj> .</s>

Bibtex

If the contents of this repository are helpful, please consider citing our paper

@misc{bao2024handsonvlmvisionlanguagemodelshandobject,
      title={HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction}, 
      author={Chen Bao and Jiarui Xu and Xiaolong Wang and Abhinav Gupta and Homanga Bharadhwaj},
      year={2024},
      eprint={2412.13187},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}

Acknowledgements

This repository builds upon code from LITA and hoi-forecast. We thank the authors of these papers for open-sourcing their code.

About

HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published