GitHub - huangwl18/VoxPoser: VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

[Project Page] [Paper] [Video]

Wenlong Huang¹, Chen Wang¹, Ruohan Zhang¹, Yunzhu Li^1,2, Jiajun Wu¹, Li Fei-Fei¹

¹Stanford University, ²University of Illinois Urbana-Champaign

This is the official demo code for VoxPoser, a method that uses large language models and vision-language models to zero-shot synthesize trajectories for manipulation tasks.

In this repo, we provide the implementation of VoxPoser in RLBench as its task diversity best resembles our real-world setup. Note that VoxPoser is a zero-shot method that does not require any training data. Therefore, the main purpose of this repo is to provide a demo implementation rather than an evaluation benchmark.

Note: This codebase currently does not contain the perception pipeline used in our real-world experiments, which produces a real-time mapping from object names to object masks. Instead, it uses the object masks provided as part of RLBench's get_observation function. If you are interested in deploying the code on a real robot, you may find more information in the section Real World Deployment.

If you find this work useful in your research, please cite using the following BibTeX:

@article{huang2023voxposer,
      title={VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models},
      author={Huang, Wenlong and Wang, Chen and Zhang, Ruohan and Li, Yunzhu and Wu, Jiajun and Fei-Fei, Li},
      journal={arXiv preprint arXiv:2307.05973},
      year={2023}
    }

Setup Instructions

Note that this codebase is best run with a display. For running in headless mode, refer to the instructions in RLBench.

Create a conda environment:

conda create -n voxposer-env python=3.9
conda activate voxposer-env

See Instructions to install PyRep and RLBench (Note: install these inside the created conda environment).
Install other dependencies:

pip install -r requirements.txt

Obtain an OpenAI API key, and put it inside the first cell of the demo notebook.

Running Demo

Demo code is at src/playground.ipynb. Instructions can be found in the notebook.

Code Structure

Core to VoxPoser:

playground.ipynb: Playground for VoxPoser.
LMP.py: Implementation of Language Model Programs (LMPs) that recursively generates code to decompose instructions and compose value maps for each sub-task.
interfaces.py: Interface that provides necessary APIs for language models (i.e., LMPs) to operate in voxel space and to invoke motion planner.
planners.py: Implementation of a greedy planner that plans a trajectory (represented as a series of waypoints) for an entity/movable given a value map.
controllers.py: Given a waypoint for an entity/movable, the controller applies (a series of) robot actions to achieve the waypoint.
dynamics_models.py: Environment dynamics model for the case where entity/movable is an object or object part. This is used in controllers.py to perform MPC.
prompts/rlbench: Prompts used by the different Language Model Programs (LMPs) in VoxPoser.

Environment and utilities:

envs:
- rlbench_env.py: Wrapper of RLBench env to expose useful functions for VoxPoser.
- task_object_names.json: Mapping of object names exposed to VoxPoser and their corresponding scene object names for each individual task.
configs/rlbench_config.yaml: Config file for all the involved modules in RLBench environment.
arguments.py: Argument parser for the config file.
LLM_cache.py: Caching of language model outputs that writes to disk to save cost and time.
utils.py: Utility functions.
visualizers.py: A Plotly-based visualizer for value maps and planned trajectories.

Real-World Deployment

To adapt the code to deploy on a real robot, most changes should only happen in the environment file (e.g., you can consider making a copy of rlbench_env.py and implementing the same APIs based on your perception and controller modules).

Our perception pipeline consists of the following modules: OWL-ViT for open-vocabulary detection in the first frame, SAM for converting the produced bounding boxes to masks in the first frame, and XMEM for tracking the masks over time for the subsequent frames. Now you may consider simplifying the pipeline using only an open-vocabulary detector and SAM 2 for segmentation and tracking. Our controller is based on the OSC implementation from Deoxys. More details can be found in the paper.

To avoid compounded latency introduced by different modules (especially the perception pipeline), you may also consider running a concurrent process that only performs tracking.

Acknowledgments

Environment is based on RLBench.
Implementation of Language Model Programs (LMPs) is based on Code as Policies.
Some code snippets are from Where2Act.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
media		media
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

[Project Page] [Paper] [Video]

Setup Instructions

Running Demo

Code Structure

Real-World Deployment

Acknowledgments

About

Releases

Packages

Languages

License

huangwl18/VoxPoser

Folders and files

Latest commit

History

Repository files navigation

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

[Project Page] [Paper] [Video]

Setup Instructions

Running Demo

Code Structure

Real-World Deployment

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages