[Feature] Update PPO RGB baselines (#475)

* work * bug fix * Update ppo_rgb.py * better hyperparameters and bug fixes * work * work * docs * make quadruped reach markov in RGB setting (for infinite horizon MDP) * w
haosulab · Aug 14, 2024 · 5643515 · 5643515
1 parent 5da1a67
commit 5643515
Show file tree

Hide file tree

Showing 7 changed files with 151 additions and 75 deletions.
diff --git a/docs/source/user_guide/reinforcement_learning/baselines.md b/docs/source/user_guide/reinforcement_learning/baselines.md
@@ -8,7 +8,7 @@ As part of these baselines we establish standardized [reinforcement learning ben
 
 ## Online Reinforcement Learning Baselines
 
-List of already implemented and tested online reinforcement learning baselines. Note that there are also reinforcement learning (offline RL, online imitation learning) baselines that leverage demonstrations, see the [learning from demos page](../learning_from_demos/index.md) for more information.
+List of already implemented and tested online reinforcement learning baselines. The results link take you to the respective wandb pages for the results. You can change filters/views in the wandb workspace to view results with other settings (e.g. state based or RGB based training). Note that there are also reinforcement learning (offline RL, online imitation learning) baselines that leverage demonstrations, see the [learning from demos page](../learning_from_demos/index.md) for more information.
 
 | Baseline                                                            | Code                                                                           | Results | Paper                                    |
 | ------------------------------------------------------------------- | ------------------------------------------------------------------------------ | ------- | ---------------------------------------- |
@@ -18,15 +18,15 @@ List of already implemented and tested online reinforcement learning baselines.
 
 ## Standard Benchmark
 
-The standard benchmark for RL in ManiSkill consists of two groups, a small set of 10 tasks, and a large set of 50 tasks, both with state based and visual based settings. All standard benchmark tasks come with normalized dense reward functions. A recommended small set is created so researchers without access to a lot of compute can still reasonably benchmark/compare their work. The large set is still being developed and tested. 
+The standard benchmark for RL in ManiSkill consists of two groups, a small set of 8 tasks, and a large set of 50 tasks, both with state based and visual based settings. All standard benchmark tasks come with normalized dense reward functions. A recommended small set is created so researchers without access to a lot of compute can still reasonably benchmark/compare their work. The large set is still being developed and tested. 
 
 
 These tasks span an extremely wide range of problems in robotics/reinforcement learning, namely: high dimensional observations/actions, large initial state distributions, articulated object manipulation, generalizable manipulation, mobile manipulation, locomotion etc.
 
 
 **Small Set Environment IDs**: 
 <!-- PushCube-v1, PickCube-v1, StackCube-v1, PegInsertionSide-v1, PushT-v1, PickSingleYCB-v1, PlugCharger-v1, OpenCabinetDrawer-v1, HumanoidPlaceAppleInBowl-v1, AnymalC-Reach-v1 -->
-PushCube-v1, PickCube-v1, PegInsertionSide-v1, PushT-v1, HumanoidPlaceAppleInBowl-v1, AnymalC-Reach-v1
+PushCube-v1, PickCube-v1, PegInsertionSide-v1, PushT-v1, HumanoidPlaceAppleInBowl-v1, AnymalC-Reach-v1, OpenCabinetDrawer-v1
 <!-- TODO: add image of all tasks / gif of them -->
 
 <!-- 

diff --git a/examples/baselines/ppo/README.md b/examples/baselines/ppo/README.md
@@ -60,7 +60,9 @@ python ppo_rgb.py --env_id="PickCube-v1" \
 
 and it will save videos to the `path/to/test_videos`.
 
-The results of running the baseline scripts for RGB based PPO are a WIP.
+The examples.sh file has a full list of tested commands for running RGB based PPO successfully on many tasks.
+
+The results of running the baseline scripts for RGB based PPO are here: https://wandb.ai/stonet2000/ManiSkill/groups/PPO/workspace?nw=69soa2dqa9h
 
 ## Visual (RGB+Depth) Based RL
 

diff --git a/examples/baselines/ppo/baselines.sh b/examples/baselines/ppo/baselines.sh
@@ -4,32 +4,34 @@
 # Furthermore, because of how these are evaluated, the hyperparameters here are tuned differently compared to with partial resets
 
 seeds=(9351 4796 1788)
+
+### State Based PPO Baselines ###
 for seed in ${seeds[@]}
 do
   python ppo.py --env_id="PushCube-v1" --seed=${seed} \
-    --num_envs=2048 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
-    --total_timesteps=50_000_000 --eval_freq=10 --num-steps=20 \
-    --no_partial_reset --reconfiguration_freq=1 \
+    --num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=1 \
+    --total_timesteps=50_000_000 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
     --exp-name="ppo-PushCube-v1-state-${seed}-walltime_efficient" \
     --wandb_entity="stonet2000" --track
 done
 
 for seed in ${seeds[@]}
 do
   python ppo.py --env_id="PickCube-v1" --seed=${seed} \
-    --num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
+    --num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=1 \
     --total_timesteps=50_000_000 \
-    --no_partial_reset --reconfiguration_freq=1 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
     --exp-name="ppo-PickCube-v1-state-${seed}-walltime_efficient" \
     --wandb_entity="stonet2000" --track
 done
 
 for seed in ${seeds[@]}
 do
   python ppo.py --env_id="PickSingleYCB-v1" --seed=${seed} \
-    --num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
+    --num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=1 \
     --total_timesteps=50_000_000 \
-    --no_partial_reset --reconfiguration_freq=1 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
     --exp-name="ppo-PickSingleYCB-v1-state-${seed}-walltime_efficient" \
     --wandb_entity="stonet2000" --track
 done
@@ -39,7 +41,7 @@ do
   python ppo.py --env_id="PushT-v1" --seed=${seed} \
     --num_envs=1024 --update_epochs=8 --num_minibatches=32 --gamma=0.99 --reward_scale=0.1 \
     --total_timesteps=50_000_000 --num-steps=100 --num_eval_steps=100 \
-    --no_partial_reset --reconfiguration_freq=1 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
     --exp-name="ppo-PushT-v1-state-${seed}-walltime_efficient" \
     --wandb_entity="stonet2000" --track
 done
@@ -49,7 +51,49 @@ do
   python ppo.py --env_id="AnymalC-Reach-v1" --seed=${seed} \
     --num_envs=1024 --update_epochs=8 --num_minibatches=32 --gamma=0.99 --gae_lambda=0.95 --reward_scale=0.1 \
     --total_timesteps=50_000_000 --num-steps=200 --num-eval-steps=200 \
-    --no_partial_reset --reconfiguration_freq=1 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
     --exp-name="ppo-AnymalC-Reach-v1-state-${seed}-walltime_efficient" \
     --wandb_entity="stonet2000" --track
-done
+done
+
+### RGB Based PPO Baselines ###
+for seed in ${seeds[@]}
+do
+  python ppo_rgb.py --env_id="PushCube-v1" --seed=${seed} \
+    --num_envs=256 --update_epochs=8 --num_minibatches=8 --reward_scale=1 \
+    --total_timesteps=50_000_000 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
+    --exp-name="ppo-PushCube-v1-rgb-${seed}-walltime_efficient" \
+    --wandb_entity="stonet2000" --track
+done
+
+for seed in ${seeds[@]}
+do
+  python ppo_rgb.py --env_id="PickCube-v1" --seed=${seed} \
+    --num_envs=256 --update_epochs=8 --num_minibatches=8 --reward_scale=1 \
+    --total_timesteps=50_000_000 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
+    --exp-name="ppo-PickCube-v1-rgb-${seed}-walltime_efficient" \
+    --wandb_entity="stonet2000" --track
+done
+
+for seed in ${seeds[@]}
+do
+  python ppo_rgb.py --env_id="AnymalC-Reach-v1" --seed=${seed} \
+    --num_envs=256 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
+    --total_timesteps=50_000_000 --num-steps=200 --num-eval-steps=200 \
+    --gamma=0.99 --gae_lambda=0.95 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
+    --exp-name="ppo-AnymalC-Reach-v1-rgb-${seed}-walltime_efficient" \
+    --wandb_entity="stonet2000" --track
+done
+
+for seed in ${seeds[@]}
+do
+  python ppo_rgb.py --env_id="PushT-v1" --seed=${seed} \
+    --num_envs=256 --update_epochs=8 --num_minibatches=8 --reward_scale=0.1 \
+    --total_timesteps=50_000_000 --num-steps=100 --num_eval_steps=100 --gamma=0.99 \
+    --no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
+    --exp-name="ppo-PushT-v1-rgb-${seed}-walltime_efficient" \
+    --wandb_entity="stonet2000" --track
+  done
diff --git a/examples/baselines/ppo/ppo.py b/examples/baselines/ppo/ppo.py
@@ -211,8 +211,8 @@ def close(self):
         if args.track:
             import wandb
             config = vars(args)
-            config["env_cfg"] = dict(**env_kwargs, num_envs=args.num_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps)
-            config["eval_env_cfg"] = dict(**env_kwargs, num_envs=args.num_eval_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps)
+            config["env_cfg"] = dict(**env_kwargs, num_envs=args.num_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps, partial_reset=args.partial_reset)
+            config["eval_env_cfg"] = dict(**env_kwargs, num_envs=args.num_eval_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps, partial_reset=args.partial_reset)
             wandb.init(
                 project=args.wandb_project_name,
                 entity=args.wandb_entity,