Skip to content

Commit

Permalink
[Feature] SAC + reverse forward curriculum learning (#365)
Browse files Browse the repository at this point in the history
* support using first env state and replaying actions on the GPU for CPU-GPU transfer

* work

* add offline buffer, better logging

* work

* Update sac.py

* fixes

* working baseline for difficult pick cube task (due to static requirement)

* partial reset in GPU sim support + partial reset support for RFCL + SAC

* bug fix

* bug fixes

* bug fixes

* bug fixes

* Update README.md

* add citations, docs

* docs

* docs

* reorganize code, note forward curr not done yet.

* fixes

* weighted trajectory sampling
  • Loading branch information
StoneT2000 authored Jun 4, 2024
1 parent 993c56f commit 7033cb2
Show file tree
Hide file tree
Showing 21 changed files with 1,083 additions and 57 deletions.
60 changes: 42 additions & 18 deletions docs/source/user_guide/datasets/replay.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,44 +19,68 @@ python -m mani_skill.trajectory.replay_trajectory -h
The script requires `trajectory.h5` and `trajectory.json` to be both under the same directory.
:::

The raw demonstration files contain all the necessary information (e.g. initial states, actions, seeds) to reproduce a trajectory. Observations are not included since they can lead to large file sizes without postprocessing. In addition, actions in these files do not cover all control modes. Therefore, you need to convert the raw files into your desired observation and control modes. We provide a utility script that works as follows:
By default raw demonstration files contain all the necessary information (e.g. initial states, actions, seeds) to reproduce a trajectory. Observations are not included since they can lead to large file sizes without postprocessing. In addition, actions in these files do not cover all control modes. Therefore, you need to convert the raw files into your desired observation and control modes. We provide a utility script that works as follows:

```bash
# Replay demonstrations with control_mode=pd_joint_delta_pos
python -m mani_skill.trajectory.replay_trajectory \
--traj-path demos/rigid_body/PickCube-v1/trajectory.h5 \
--save-traj --target-control-mode pd_joint_delta_pos --obs-mode none --num-procs 10
--save-traj --target-control-mode pd_joint_delta_pos \
--obs-mode none --num-procs 10
```

<details>

<summary><b>Click here</b> for important notes about the script arguments.</summary>
:::{dropdown} Click here to see the replay trajectory tool options

Command Line Options:

- `--save-traj`: save the replayed trajectory to the same folder as the original trajectory file.
- `--num-procs=10`: split trajectories to multiple processes (e.g., 10 processes) for acceleration.
- `--target-control-mode`: The target control mode / action space to save into the trajectory file.
- `--save-video`: Whether to save a video of the replayed trajectories
- `--max-retry`: Max number of times to try and replay each trajectory
- `--discard-timeout`: Whether to discard trajectories that time out due to the default environment's max episode steps config
- `--allow-failure`: Whether to permit saving failed trajectories
- `--vis`: Whether to open the GUI and show the replayed trajectories on a display
- `--use-first-env-state`: Whether to use the first environment state of the given trajectory to initialize the environment
- `--num-procs=10`: split trajectories to multiple processes (e.g., 10 processes) for acceleration. Note this is done via CPU parallelization, not GPU. This argument is also currently incompatible with using the GPU simulation to replay trajectories.
- `--obs-mode=none`: specify the observation mode as `none`, i.e. not saving any observations.
- `--obs-mode=rgbd`: (not included in the script above) specify the observation mode as `rgbd` to replay the trajectory. If `--save-traj`, the saved trajectory will contain the RGBD observations. RGB images are saved as uint8 and depth images (multiplied by 1024) are saved as uint16.
- `--obs-mode=pointcloud`: (not included in the script above) specify the observation mode as `pointcloud`. We encourage you to further process the point cloud instead of using this point cloud directly.
- `--obs-mode=state`: (not included in the script above) specify the observation mode as `state`. Note that the `state` observation mode is not allowed for challenge submission.
- `--obs-mode=rgbd`: (not included in the script above) specify the observation mode as `rgbd` to replay the trajectory. If `--save-traj`, the saved trajectory will contain the RGBD observations.
- `--obs-mode=pointcloud`: (not included in the script above) specify the observation mode as `pointcloud`. We encourage you to further process the point cloud instead of using this point cloud directly (e.g. sub-sampling the pointcloud)
- `--obs-mode=state`: (not included in the script above) specify the observation mode as `state`
- `--use-env-states`: For each time step $t$, after replaying the action at this time step and obtaining a new observation at $t+1$, set the environment state at time $t+1$ as the recorded environment state at time $t+1$. This is necessary for successfully replaying trajectories for the tasks migrated from ManiSkill1.
</details>

<br>

- `--count`: Number of demonstrations to replay before exiting. By default all demonstrations are replayed
- `--shader`: "Change shader used for rendering. Default is 'default' which is very fast. Can also be 'rt' for ray tracing and generating photo-realistic renders. Can also be 'rt-fast' for a faster but lower quality ray-traced renderer"
- `--render-mode`: The render mode used in the video saving
- `-b, --sim-backend`: Which simulation backend to use. Can be 'auto', 'cpu', or 'gpu'
:::
<!--
:::{note}
For soft-body tasks, please compile and generate caches (`python -m mani_skill.utils.precompile_mpm`) before running the script with multiple processes (with `--num-procs`).
:::

::: -->
<!--
:::{caution}
The conversion between controllers (or action spaces) is not yet supported for mobile manipulators (e.g., used in tasks migrated from ManiSkill1).
:::
::: -->

:::{caution}
Since some demonstrations are collected in a non-quasi-static way (objects are not fixed relative to the manipulator during manipulation) for some challenging tasks (e.g., `TurnFaucet` and tasks migrated from ManiSkill1), replaying actions can fail due to non-determinism in simulation. Thus, replaying trajectories by environment states is required (passing `--use-env-states`).
:::

---
## Example Usages

We recommend using our script only for converting actions into different control modes without recording any observation information (i.e. passing `--obs-mode=none`). The reason is that (1) some observation modes, e.g. point cloud, can take too much space without any post-processing, e.g., point cloud downsampling; in addition, the `state` mode for soft-body tasks also has a similar issue, since the states of those tasks are particles. (2) Some algorithms (e.g. GAIL) require custom keys stored in the demonstration files, e.g. next-observation.
As the replay trajectory tool is fairly complex and feature rich, we suggest a few example workflows that may be useful for various use cases

Thus we recommend that, after you convert actions into different control modes, implement your custom environment wrappers for observation processing. After this, use another script to render and save the corresponding post-processed visual demonstrations. [ManiSkill-Learn](https://github.com/haosulab/ManiSkill-Learn) has included such observation processing wrappers and demonstration conversion script (with multi-processing), so we recommend referring to the repo for more details.

### Replaying Trajectories collected in CPU/GPU sim to GPU/CPU sim

Some demonstrations may have been collected on the CPU simulation but you want data that works for the GPU simulation and vice versa. Inherently CPU and GPU simulation will have slightly different behaviors given the same actions and the same start state.

For example if you use teleoperation to collect demos, these are often collected in the CPU sim for flexibility and single-thread speed. However imitation/reinforcement learning workflows might use the GPU simulation for training. In order to ensure the demos can be learned from, we can replay them in the GPU simulation and save the ones that replay successfully. This is done by using the first environment state, force using the GPU simulation with `-b "gpu"`, and setting desired control and observation modes.

```bash
python -m mani_skill.trajectory.replay_trajectory \
--traj-path path/to/trajectory.h5 \
--use-first-env-state -b "gpu" \
-c pd_joint_delta_pos -o state \
--save-traj
```
4 changes: 0 additions & 4 deletions docs/source/user_guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,7 @@ tutorials/index
concepts/index
datasets/index
data_collection/index
```
<!-- algorithms_and_models/index
workflows/index -->


```{toctree}
:maxdepth: 2
Expand Down
9 changes: 7 additions & 2 deletions docs/source/user_guide/workflows/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# Workflows

We provide a number of tuned baselines/workflows for robot learning and training autonomous policies to solve robotics tasks. These span learning from demonstrations/imitation learning and reinforcement learning.

This is still a WIP but we plan to upload as many pretrained checkpoints, training curves, etc. for all solvable tasks (some need much more advanced techniques to solve) online for people to research and compare with.

```{toctree}
:titlesonly:
:glob:
*
learning_from_demos/index
reinforcement_learning/index
```
1 change: 0 additions & 1 deletion docs/source/user_guide/workflows/learning_from_demos.md

This file was deleted.

24 changes: 24 additions & 0 deletions docs/source/user_guide/workflows/learning_from_demos/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Learning from Demonstrations / Imitation Learning

We provide a number of different baselines spanning different categories of learning from demonstrations research: Behavior Cloning / Supervised Learning, Offline Reinforcement Learning, and Online Learning from Demonstrations.

As part of these baselines we establish a few standard learning from demonstration benchmarks that cover a wide range of difficulty (easy to solve for verification but not saturated) and diversity in types of demonstrations (human collected, motion planning collected, neural net policy generated)

**Behavior Cloning Baselines**
| Baseline | Code | Results |
| ---------------------------------- | ---- | ------- |
| Standard Behavior Cloning (BC) | WIP | WIP |
| Diffusion Policy (DP) | WIP | WIP |
| Action Chunk Transformers (ACT) | WIP | WIP |


**Online Learning from Demonstrations Baselines**

| Baseline | Code | Results | Paper |
| --------------------------------------------------- | ----------------------------------------------------------------------------------- | ------- | ---------------------------------------- |
| SAC+Reverse Forward Curriculum Learning (SAC+RFCL)* | [Link](https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/sac-rfcl) | WIP | [Link](https://arxiv.org/abs/2405.03379) |
| Reinforcement Learning from Prior Data (RLPD) | WIP | WIP | [Link](https://arxiv.org/abs/2302.02948) |
| SAC + Demos (SAC+Demos) | WIP | N/A | |


\* - This indicates the baseline uses environment state reset
1 change: 0 additions & 1 deletion docs/source/user_guide/workflows/reinforcement_learning.md

This file was deleted.

15 changes: 15 additions & 0 deletions docs/source/user_guide/workflows/reinforcement_learning/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Reinforcement Learning (WIP)

We provide a number of different baselines that learn from rewards. For RL baselines that leverage demonstrations see the [learning from demos section](../learning_from_demos/)

As part of these baselines we establish a few reinforcement learning benchmarks that cover a wide range of difficulties (easy to solve for verification but not saturated) and diversity in types of robotics task, including but not limited to classic control, dextrous manipulation, table-top manipulation, mobile manipulation etc.


Online Reinforcement Learning Baselines

| Baseline | Code | Results | Paper |
| ------------------------------------------------------------------- | ------------------------------------------------------------------------------ | ------- | ---------------------------------------- |
| Proximal Policy Optimization (PPO) | [Link](https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/ppo) | WIP | [Link](http://arxiv.org/abs/1707.06347) |
| Soft Actor Critic (SAC) | [Link](https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/sac) | WIP | [Link](https://arxiv.org/abs/1801.01290) |
| Temporal Difference Learning for Model Predictive Control (TD-MPC2) | WIP | WIP | [Link](https://arxiv.org/abs/2310.16828) |

25 changes: 24 additions & 1 deletion examples/baselines/ppo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,4 +144,27 @@ This will use environment states to replay trajectories, turn on the ray-tracer
## Some Notes

- The code currently does not have the best way to evaluate the agents in that during GPU simulation, all assets are frozen per parallel environment (changing them slows training down). Thus when doing evaluation, even though we evaluate on multiple (8 is default) environments at once, they will always feature the same set of geometry. This only affects tasks where there is geometry variation (e.g. PickClutterYCB, OpenCabinetDrawer). You can make it more accurate by increasing the number of evaluation environments. Our team is discussing still what is the best way to evaluate trained agents properly without hindering performance.
- Many tasks support visual observations, however we have not carefully verified if the camera poses for the tasks are setup in a way that makes it possible to solve the task from visual observations.
- Many tasks support visual observations, however we have not carefully verified if the camera poses for the tasks are setup in a way that makes it possible to solve the task from visual observations.

## Citation

If you use this baseline please cite the following
```
@article{DBLP:journals/corr/SchulmanWDRK17,
author = {John Schulman and
Filip Wolski and
Prafulla Dhariwal and
Alec Radford and
Oleg Klimov},
title = {Proximal Policy Optimization Algorithms},
journal = {CoRR},
volume = {abs/1707.06347},
year = {2017},
url = {http://arxiv.org/abs/1707.06347},
eprinttype = {arXiv},
eprint = {1707.06347},
timestamp = {Mon, 13 Aug 2018 16:47:34 +0200},
biburl = {https://dblp.org/rec/journals/corr/SchulmanWDRK17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
2 changes: 1 addition & 1 deletion examples/baselines/ppo/ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ class Args:
num_eval_envs: int = 8
"""the number of parallel evaluation environments"""
partial_reset: bool = True
"""toggle if the environments should perform partial resets"""
"""whether to let parallel environments reset upon termination instead of truncation"""
num_steps: int = 50
"""the number of steps to run in each environment per policy rollout"""
num_eval_steps: int = 50
Expand Down
1 change: 1 addition & 0 deletions examples/baselines/sac-rfcl/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
runs
52 changes: 52 additions & 0 deletions examples/baselines/sac-rfcl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Reverse Forward Curriculum Learning

Fast offline/online imitation learning in simulation based on "Reverse Forward Curriculum Learning for Extreme Sample and Demo Efficiency in Reinforcement Learning (ICLR 2024)". Code adapted from https://github.com/StoneT2000/rfcl/

Currently this code only works with environments that do not have geometry variations between parallel environments (e.g. PickCube).

Code has been tested and working on the following environments: PickCube-v1

This implementation currently does not include the forward curriculum.

## Download and Process Dataset

Download demonstrations for a desired task e.g. PickCube-v1
```bash
python -m mani_skill.utils.download_demo "PickCube-v1"
```

Process the demonstrations in preparation for the imitation learning workflow
```bash
python -m mani_skill.trajectory.replay_trajectory \
--traj-path ~/.maniskill/demos/PickCube-v1/teleop/trajectory.h5 \
--use-first-env-state -b "gpu" \
-c pd_joint_delta_pos -o state \
--save-traj
```

## Train

```bash
python sac_rfcl.py --env_id="PickCube-v1" \
--num_envs=16 --training_freq=32 --utd=0.5 --buffer_size=1_000_000 \
--total_timesteps=1_000_000 --eval_freq=25_000 \
--dataset_path=~/.maniskill/demos/PickCube-v1/teleop/trajectory.state.pd_joint_delta_pos.h5 \
--num-demos=5 --seed=2 --save_train_video_freq=15
```


## Additional Notes about Implementation

For SAC with RFCL, we always bootstrap on truncated/done.

## Citation

If you use this baseline please cite the following
```
@article{tao2024rfcl,
title={Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in RL},
author={Tao, Stone and Shukla, Arth and Chan, Tse-kai and Su, Hao},
booktitle = {International Conference on Learning Representations (ICLR)},
year={2024}
}
```
Loading

0 comments on commit 7033cb2

Please sign in to comment.