[Feature] SAC + reverse forward curriculum learning (#365)

* support using first env state and replaying actions on the GPU for CPU-GPU transfer * work * add offline buffer, better logging * work * Update sac.py * fixes * working baseline for difficult pick cube task (due to static requirement) * partial reset in GPU sim support + partial reset support for RFCL + SAC * bug fix * bug fixes * bug fixes * bug fixes * Update README.md * add citations, docs * docs * docs * reorganize code, note forward curr not done yet. * fixes * weighted trajectory sampling
haosulab · Jun 4, 2024 · 7033cb2 · 7033cb2
1 parent 993c56f
commit 7033cb2
Show file tree

Hide file tree

Showing 21 changed files with 1,083 additions and 57 deletions.
diff --git a/docs/source/user_guide/datasets/replay.md b/docs/source/user_guide/datasets/replay.md
@@ -19,44 +19,68 @@ python -m mani_skill.trajectory.replay_trajectory -h
 The script requires `trajectory.h5` and `trajectory.json` to be both under the same directory.
 :::
 
-The raw demonstration files contain all the necessary information (e.g. initial states, actions, seeds) to reproduce a trajectory. Observations are not included since they can lead to large file sizes without postprocessing. In addition, actions in these files do not cover all control modes. Therefore, you need to convert the raw files into your desired observation and control modes. We provide a utility script that works as follows:
+By default raw demonstration files contain all the necessary information (e.g. initial states, actions, seeds) to reproduce a trajectory. Observations are not included since they can lead to large file sizes without postprocessing. In addition, actions in these files do not cover all control modes. Therefore, you need to convert the raw files into your desired observation and control modes. We provide a utility script that works as follows:
 
 ```bash
 # Replay demonstrations with control_mode=pd_joint_delta_pos
 python -m mani_skill.trajectory.replay_trajectory \
   --traj-path demos/rigid_body/PickCube-v1/trajectory.h5 \
-  --save-traj --target-control-mode pd_joint_delta_pos --obs-mode none --num-procs 10
+  --save-traj --target-control-mode pd_joint_delta_pos \
+  --obs-mode none --num-procs 10
 ```
 
-<details>
 
-<summary><b>Click here</b> for important notes about the script arguments.</summary>
+:::{dropdown} Click here to see the replay trajectory tool options
+
+Command Line Options:
 
 - `--save-traj`: save the replayed trajectory to the same folder as the original trajectory file.
-- `--num-procs=10`: split trajectories to multiple processes (e.g., 10 processes) for acceleration.
+- `--target-control-mode`: The target control mode / action space to save into the trajectory file.
+- `--save-video`: Whether to save a video of the replayed trajectories
+- `--max-retry`: Max number of times to try and replay each trajectory
+- `--discard-timeout`: Whether to discard trajectories that time out due to the default environment's max episode steps config
+- `--allow-failure`: Whether to permit saving failed trajectories
+- `--vis`: Whether to open the GUI and show the replayed trajectories on a display
+- `--use-first-env-state`: Whether to use the first environment state of the given trajectory to initialize the environment
+- `--num-procs=10`: split trajectories to multiple processes (e.g., 10 processes) for acceleration. Note this is done via CPU parallelization, not GPU. This argument is also currently incompatible with using the GPU simulation to replay trajectories.
 - `--obs-mode=none`: specify the observation mode as `none`, i.e. not saving any observations.
-- `--obs-mode=rgbd`: (not included in the script above) specify the observation mode as `rgbd` to replay the trajectory. If `--save-traj`, the saved trajectory will contain the RGBD observations. RGB images are saved as uint8 and depth images (multiplied by 1024) are saved as uint16.
-- `--obs-mode=pointcloud`: (not included in the script above) specify the observation mode as `pointcloud`. We encourage you to further process the point cloud instead of using this point cloud directly.
-- `--obs-mode=state`: (not included in the script above) specify the observation mode as `state`. Note that the `state` observation mode is not allowed for challenge submission.
+- `--obs-mode=rgbd`: (not included in the script above) specify the observation mode as `rgbd` to replay the trajectory. If `--save-traj`, the saved trajectory will contain the RGBD observations.
+- `--obs-mode=pointcloud`: (not included in the script above) specify the observation mode as `pointcloud`. We encourage you to further process the point cloud instead of using this point cloud directly (e.g. sub-sampling the pointcloud)
+- `--obs-mode=state`: (not included in the script above) specify the observation mode as `state`
 - `--use-env-states`: For each time step $t$, after replaying the action at this time step and obtaining a new observation at $t+1$, set the environment state at time $t+1$ as the recorded environment state at time $t+1$. This is necessary for successfully replaying trajectories for the tasks migrated from ManiSkill1.
-</details>
-
-<br>
-
+- `--count`: Number of demonstrations to replay before exiting. By default all demonstrations are replayed
+- `--shader`: "Change shader used for rendering. Default is 'default' which is very fast. Can also be 'rt' for ray tracing and generating photo-realistic renders. Can also be 'rt-fast' for a faster but lower quality ray-traced renderer"
+- `--render-mode`: The render mode used in the video saving
+- `-b, --sim-backend`: Which simulation backend to use. Can be 'auto', 'cpu', or 'gpu'
+:::
+<!-- 
 :::{note}
 For soft-body tasks, please compile and generate caches (`python -m mani_skill.utils.precompile_mpm`) before running the script with multiple processes (with `--num-procs`).
-:::
-
+::: -->
+<!-- 
 :::{caution}
 The conversion between controllers (or action spaces) is not yet supported for mobile manipulators (e.g., used in tasks migrated from ManiSkill1).
-:::
+::: -->
 
 :::{caution}
 Since some demonstrations are collected in a non-quasi-static way (objects are not fixed relative to the manipulator during manipulation) for some challenging tasks (e.g., `TurnFaucet` and tasks migrated from ManiSkill1), replaying actions can fail due to non-determinism in simulation. Thus, replaying trajectories by environment states is required (passing `--use-env-states`).
 :::
 
----
+## Example Usages
 
-We recommend using our script only for converting actions into different control modes without recording any observation information (i.e. passing `--obs-mode=none`). The reason is that (1) some observation modes, e.g. point cloud, can take too much space without any post-processing, e.g., point cloud downsampling; in addition, the `state` mode for soft-body tasks also has a similar issue, since the states of those tasks are particles. (2) Some algorithms  (e.g. GAIL) require custom keys stored in the demonstration files, e.g. next-observation.
+As the replay trajectory tool is fairly complex and feature rich, we suggest a few example workflows that may be useful for various use cases
 
-Thus we recommend that, after you convert actions into different control modes, implement your custom environment wrappers for observation processing. After this, use another script to render and save the corresponding post-processed visual demonstrations. [ManiSkill-Learn](https://github.com/haosulab/ManiSkill-Learn) has included such observation processing wrappers and demonstration conversion script (with multi-processing), so we recommend referring to the repo for more details.
+
+### Replaying Trajectories collected in CPU/GPU sim to GPU/CPU sim
+
+Some demonstrations may have been collected on the CPU simulation but you want data that works for the GPU simulation and vice versa. Inherently CPU and GPU simulation will have slightly different behaviors given the same actions and the same start state.
+
+For example if you use teleoperation to collect demos, these are often collected in the CPU sim for flexibility and single-thread speed. However imitation/reinforcement learning workflows might use the GPU simulation for training. In order to ensure the demos can be learned from, we can replay them in the GPU simulation and save the ones that replay successfully. This is done by using the first environment state, force using the GPU simulation with `-b "gpu"`, and setting desired control and observation modes.
+
+```bash
+python -m mani_skill.trajectory.replay_trajectory \
+  --traj-path path/to/trajectory.h5 \
+  --use-first-env-state -b "gpu" \
+  -c pd_joint_delta_pos -o state \
+  --save-traj
+```
diff --git a/docs/source/user_guide/index.md b/docs/source/user_guide/index.md
@@ -40,11 +40,7 @@ tutorials/index
 concepts/index
 datasets/index
 data_collection/index
-
 ```
-<!-- algorithms_and_models/index
-workflows/index -->
-
 
 ```{toctree}
 :maxdepth: 2

diff --git a/docs/source/user_guide/workflows/index.md b/docs/source/user_guide/workflows/index.md
@@ -1,7 +1,12 @@
 # Workflows
+
+We provide a number of tuned baselines/workflows for robot learning and training autonomous policies to solve robotics tasks. These span learning from demonstrations/imitation learning and reinforcement learning.
+
+This is still a WIP but we plan to upload as many pretrained checkpoints, training curves, etc. for all solvable tasks (some need much more advanced techniques to solve) online for people to research and compare with.
+
 ```{toctree}
 :titlesonly:
-:glob:
 
-*
+learning_from_demos/index
+reinforcement_learning/index
 ```
diff --git a/docs/source/user_guide/workflows/learning_from_demos.md b/docs/source/user_guide/workflows/learning_from_demos.md
diff --git a/docs/source/user_guide/workflows/learning_from_demos/index.md b/docs/source/user_guide/workflows/learning_from_demos/index.md
@@ -0,0 +1,24 @@
+# Learning from Demonstrations / Imitation Learning
+
+We provide a number of different baselines spanning different categories of learning from demonstrations research: Behavior Cloning / Supervised Learning, Offline Reinforcement Learning, and Online Learning from Demonstrations.
+
+As part of these baselines we establish a few standard learning from demonstration benchmarks that cover a wide range of difficulty (easy to solve for verification but not saturated) and diversity in types of demonstrations (human collected, motion planning collected, neural net policy generated)
+
+**Behavior Cloning Baselines**
+| Baseline                           | Code | Results |
+| ---------------------------------- | ---- | ------- |
+| Standard Behavior Cloning (BC) | WIP  | WIP     |
+| Diffusion Policy (DP)                   | WIP  | WIP     |
+| Action Chunk Transformers (ACT)    | WIP  | WIP     |
+
+
+**Online Learning from Demonstrations Baselines**
+
+| Baseline                                            | Code                                                                                | Results | Paper                                    |
+| --------------------------------------------------- | ----------------------------------------------------------------------------------- | ------- | ---------------------------------------- |
+| SAC+Reverse Forward Curriculum Learning (SAC+RFCL)* | [Link](https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/sac-rfcl) | WIP     | [Link](https://arxiv.org/abs/2405.03379) |
+| Reinforcement Learning from Prior Data (RLPD)       | WIP                                                                                 | WIP     | [Link](https://arxiv.org/abs/2302.02948) |
+| SAC + Demos (SAC+Demos)                             | WIP                                                                                 | N/A     |                                          |
+
+
+\* - This indicates the baseline uses environment state reset 
diff --git a/docs/source/user_guide/workflows/reinforcement_learning.md b/docs/source/user_guide/workflows/reinforcement_learning.md
diff --git a/docs/source/user_guide/workflows/reinforcement_learning/index.md b/docs/source/user_guide/workflows/reinforcement_learning/index.md
@@ -0,0 +1,15 @@
+# Reinforcement Learning (WIP)
+
+We provide a number of different baselines that learn from rewards. For RL baselines that leverage demonstrations see the [learning from demos section](../learning_from_demos/)
+
+As part of these baselines we establish a few reinforcement learning benchmarks that cover a wide range of difficulties (easy to solve for verification but not saturated) and diversity in types of robotics task, including but not limited to classic control, dextrous manipulation, table-top manipulation, mobile manipulation etc.
+
+
+Online Reinforcement Learning Baselines
+
+| Baseline                                                            | Code                                                                           | Results | Paper                                    |
+| ------------------------------------------------------------------- | ------------------------------------------------------------------------------ | ------- | ---------------------------------------- |
+| Proximal Policy Optimization (PPO)                                  | [Link](https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/ppo) | WIP     | [Link](http://arxiv.org/abs/1707.06347)  |
+| Soft Actor Critic (SAC)                                             | [Link](https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/sac) | WIP     | [Link](https://arxiv.org/abs/1801.01290) |
+| Temporal Difference Learning for Model Predictive Control (TD-MPC2) | WIP                                                                            | WIP     | [Link](https://arxiv.org/abs/2310.16828) |
+
diff --git a/examples/baselines/ppo/README.md b/examples/baselines/ppo/README.md
@@ -144,4 +144,27 @@ This will use environment states to replay trajectories, turn on the ray-tracer
 ## Some Notes
 
 - The code currently does not have the best way to evaluate the agents in that during GPU simulation, all assets are frozen per parallel environment (changing them slows training down). Thus when doing evaluation, even though we evaluate on multiple (8 is default) environments at once, they will always feature the same set of geometry. This only affects tasks where there is geometry variation (e.g. PickClutterYCB, OpenCabinetDrawer). You can make it more accurate by increasing the number of evaluation environments. Our team is discussing still what is the best way to evaluate trained agents properly without hindering performance.
-- Many tasks support visual observations, however we have not carefully verified if the camera poses for the tasks are setup in a way that makes it possible to solve the task from visual observations.
+- Many tasks support visual observations, however we have not carefully verified if the camera poses for the tasks are setup in a way that makes it possible to solve the task from visual observations.
+
+## Citation
+
+If you use this baseline please cite the following
+```
+@article{DBLP:journals/corr/SchulmanWDRK17,
+  author       = {John Schulman and
+                  Filip Wolski and
+                  Prafulla Dhariwal and
+                  Alec Radford and
+                  Oleg Klimov},
+  title        = {Proximal Policy Optimization Algorithms},
+  journal      = {CoRR},
+  volume       = {abs/1707.06347},
+  year         = {2017},
+  url          = {http://arxiv.org/abs/1707.06347},
+  eprinttype    = {arXiv},
+  eprint       = {1707.06347},
+  timestamp    = {Mon, 13 Aug 2018 16:47:34 +0200},
+  biburl       = {https://dblp.org/rec/journals/corr/SchulmanWDRK17.bib},
+  bibsource    = {dblp computer science bibliography, https://dblp.org}
+}
+```
diff --git a/examples/baselines/ppo/ppo.py b/examples/baselines/ppo/ppo.py
@@ -60,7 +60,7 @@ class Args:
     num_eval_envs: int = 8
     """the number of parallel evaluation environments"""
     partial_reset: bool = True
-    """toggle if the environments should perform partial resets"""
+    """whether to let parallel environments reset upon termination instead of truncation"""
     num_steps: int = 50
     """the number of steps to run in each environment per policy rollout"""
     num_eval_steps: int = 50

diff --git a/examples/baselines/sac-rfcl/.gitignore b/examples/baselines/sac-rfcl/.gitignore
@@ -0,0 +1 @@
+runs
diff --git a/examples/baselines/sac-rfcl/README.md b/examples/baselines/sac-rfcl/README.md
@@ -0,0 +1,52 @@
+# Reverse Forward Curriculum Learning
+
+Fast offline/online imitation learning in simulation based on "Reverse Forward Curriculum Learning for Extreme Sample and Demo Efficiency in Reinforcement Learning (ICLR 2024)". Code adapted from https://github.com/StoneT2000/rfcl/
+
+Currently this code only works with environments that do not have geometry variations between parallel environments (e.g. PickCube).
+
+Code has been tested and working on the following environments: PickCube-v1
+
+This implementation currently does not include the forward curriculum.
+
+## Download and Process Dataset
+
+Download demonstrations for a desired task e.g. PickCube-v1
+```bash
+python -m mani_skill.utils.download_demo "PickCube-v1"
+```
+
+Process the demonstrations in preparation for the imitation learning workflow
+```bash
+python -m mani_skill.trajectory.replay_trajectory \
+  --traj-path ~/.maniskill/demos/PickCube-v1/teleop/trajectory.h5 \
+  --use-first-env-state -b "gpu" \
+  -c pd_joint_delta_pos -o state \
+  --save-traj
+```
+
+## Train
+
+```bash
+python sac_rfcl.py --env_id="PickCube-v1" \
+  --num_envs=16 --training_freq=32 --utd=0.5 --buffer_size=1_000_000 \
+  --total_timesteps=1_000_000 --eval_freq=25_000 \
+  --dataset_path=~/.maniskill/demos/PickCube-v1/teleop/trajectory.state.pd_joint_delta_pos.h5 \
+  --num-demos=5 --seed=2 --save_train_video_freq=15
+```
+
+
+## Additional Notes about Implementation
+
+For SAC with RFCL, we always bootstrap on truncated/done.
+
+## Citation
+
+If you use this baseline please cite the following
+```
+@article{tao2024rfcl,
+  title={Reverse Forward Curriculum Learning for Extreme Sample and Demonstration Efficiency in RL},
+  author={Tao, Stone and Shukla, Arth and Chan, Tse-kai and Su, Hao},
+  booktitle = {International Conference on Learning Representations (ICLR)},
+  year={2024}
+}
+```