Skip to content

Commit

Permalink
[Feature] Update PPO RGB baselines (#475)
Browse files Browse the repository at this point in the history
* work

* bug fix

* Update ppo_rgb.py

* better hyperparameters and bug fixes

* work

* work

* docs

* make quadruped reach markov in RGB setting (for infinite horizon MDP)

* w
  • Loading branch information
StoneT2000 authored Aug 14, 2024
1 parent 5da1a67 commit 5643515
Show file tree
Hide file tree
Showing 7 changed files with 151 additions and 75 deletions.
6 changes: 3 additions & 3 deletions docs/source/user_guide/reinforcement_learning/baselines.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ As part of these baselines we establish standardized [reinforcement learning ben

## Online Reinforcement Learning Baselines

List of already implemented and tested online reinforcement learning baselines. Note that there are also reinforcement learning (offline RL, online imitation learning) baselines that leverage demonstrations, see the [learning from demos page](../learning_from_demos/index.md) for more information.
List of already implemented and tested online reinforcement learning baselines. The results link take you to the respective wandb pages for the results. You can change filters/views in the wandb workspace to view results with other settings (e.g. state based or RGB based training). Note that there are also reinforcement learning (offline RL, online imitation learning) baselines that leverage demonstrations, see the [learning from demos page](../learning_from_demos/index.md) for more information.

| Baseline | Code | Results | Paper |
| ------------------------------------------------------------------- | ------------------------------------------------------------------------------ | ------- | ---------------------------------------- |
Expand All @@ -18,15 +18,15 @@ List of already implemented and tested online reinforcement learning baselines.

## Standard Benchmark

The standard benchmark for RL in ManiSkill consists of two groups, a small set of 10 tasks, and a large set of 50 tasks, both with state based and visual based settings. All standard benchmark tasks come with normalized dense reward functions. A recommended small set is created so researchers without access to a lot of compute can still reasonably benchmark/compare their work. The large set is still being developed and tested.
The standard benchmark for RL in ManiSkill consists of two groups, a small set of 8 tasks, and a large set of 50 tasks, both with state based and visual based settings. All standard benchmark tasks come with normalized dense reward functions. A recommended small set is created so researchers without access to a lot of compute can still reasonably benchmark/compare their work. The large set is still being developed and tested.


These tasks span an extremely wide range of problems in robotics/reinforcement learning, namely: high dimensional observations/actions, large initial state distributions, articulated object manipulation, generalizable manipulation, mobile manipulation, locomotion etc.


**Small Set Environment IDs**:
<!-- PushCube-v1, PickCube-v1, StackCube-v1, PegInsertionSide-v1, PushT-v1, PickSingleYCB-v1, PlugCharger-v1, OpenCabinetDrawer-v1, HumanoidPlaceAppleInBowl-v1, AnymalC-Reach-v1 -->
PushCube-v1, PickCube-v1, PegInsertionSide-v1, PushT-v1, HumanoidPlaceAppleInBowl-v1, AnymalC-Reach-v1
PushCube-v1, PickCube-v1, PegInsertionSide-v1, PushT-v1, HumanoidPlaceAppleInBowl-v1, AnymalC-Reach-v1, OpenCabinetDrawer-v1
<!-- TODO: add image of all tasks / gif of them -->

<!--
Expand Down
4 changes: 3 additions & 1 deletion examples/baselines/ppo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,9 @@ python ppo_rgb.py --env_id="PickCube-v1" \

and it will save videos to the `path/to/test_videos`.

The results of running the baseline scripts for RGB based PPO are a WIP.
The examples.sh file has a full list of tested commands for running RGB based PPO successfully on many tasks.

The results of running the baseline scripts for RGB based PPO are here: https://wandb.ai/stonet2000/ManiSkill/groups/PPO/workspace?nw=69soa2dqa9h

## Visual (RGB+Depth) Based RL

Expand Down
64 changes: 54 additions & 10 deletions examples/baselines/ppo/baselines.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,32 +4,34 @@
# Furthermore, because of how these are evaluated, the hyperparameters here are tuned differently compared to with partial resets

seeds=(9351 4796 1788)

### State Based PPO Baselines ###
for seed in ${seeds[@]}
do
python ppo.py --env_id="PushCube-v1" --seed=${seed} \
--num_envs=2048 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
--total_timesteps=50_000_000 --eval_freq=10 --num-steps=20 \
--no_partial_reset --reconfiguration_freq=1 \
--num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=1 \
--total_timesteps=50_000_000 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-PushCube-v1-state-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done

for seed in ${seeds[@]}
do
python ppo.py --env_id="PickCube-v1" --seed=${seed} \
--num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
--num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=1 \
--total_timesteps=50_000_000 \
--no_partial_reset --reconfiguration_freq=1 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-PickCube-v1-state-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done

for seed in ${seeds[@]}
do
python ppo.py --env_id="PickSingleYCB-v1" --seed=${seed} \
--num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
--num_envs=1024 --update_epochs=8 --num_minibatches=32 --reward_scale=1 \
--total_timesteps=50_000_000 \
--no_partial_reset --reconfiguration_freq=1 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-PickSingleYCB-v1-state-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done
Expand All @@ -39,7 +41,7 @@ do
python ppo.py --env_id="PushT-v1" --seed=${seed} \
--num_envs=1024 --update_epochs=8 --num_minibatches=32 --gamma=0.99 --reward_scale=0.1 \
--total_timesteps=50_000_000 --num-steps=100 --num_eval_steps=100 \
--no_partial_reset --reconfiguration_freq=1 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-PushT-v1-state-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done
Expand All @@ -49,7 +51,49 @@ do
python ppo.py --env_id="AnymalC-Reach-v1" --seed=${seed} \
--num_envs=1024 --update_epochs=8 --num_minibatches=32 --gamma=0.99 --gae_lambda=0.95 --reward_scale=0.1 \
--total_timesteps=50_000_000 --num-steps=200 --num-eval-steps=200 \
--no_partial_reset --reconfiguration_freq=1 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-AnymalC-Reach-v1-state-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done
done

### RGB Based PPO Baselines ###
for seed in ${seeds[@]}
do
python ppo_rgb.py --env_id="PushCube-v1" --seed=${seed} \
--num_envs=256 --update_epochs=8 --num_minibatches=8 --reward_scale=1 \
--total_timesteps=50_000_000 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-PushCube-v1-rgb-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done

for seed in ${seeds[@]}
do
python ppo_rgb.py --env_id="PickCube-v1" --seed=${seed} \
--num_envs=256 --update_epochs=8 --num_minibatches=8 --reward_scale=1 \
--total_timesteps=50_000_000 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-PickCube-v1-rgb-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done

for seed in ${seeds[@]}
do
python ppo_rgb.py --env_id="AnymalC-Reach-v1" --seed=${seed} \
--num_envs=256 --update_epochs=8 --num_minibatches=32 --reward_scale=0.1 \
--total_timesteps=50_000_000 --num-steps=200 --num-eval-steps=200 \
--gamma=0.99 --gae_lambda=0.95 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-AnymalC-Reach-v1-rgb-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done

for seed in ${seeds[@]}
do
python ppo_rgb.py --env_id="PushT-v1" --seed=${seed} \
--num_envs=256 --update_epochs=8 --num_minibatches=8 --reward_scale=0.1 \
--total_timesteps=50_000_000 --num-steps=100 --num_eval_steps=100 --gamma=0.99 \
--no_partial_reset --reconfiguration_freq=1 --num_eval_envs=16 \
--exp-name="ppo-PushT-v1-rgb-${seed}-walltime_efficient" \
--wandb_entity="stonet2000" --track
done
4 changes: 2 additions & 2 deletions examples/baselines/ppo/ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,8 +211,8 @@ def close(self):
if args.track:
import wandb
config = vars(args)
config["env_cfg"] = dict(**env_kwargs, num_envs=args.num_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps)
config["eval_env_cfg"] = dict(**env_kwargs, num_envs=args.num_eval_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps)
config["env_cfg"] = dict(**env_kwargs, num_envs=args.num_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps, partial_reset=args.partial_reset)
config["eval_env_cfg"] = dict(**env_kwargs, num_envs=args.num_eval_envs, env_id=args.env_id, reward_mode="normalized_dense", env_horizon=max_episode_steps, partial_reset=args.partial_reset)
wandb.init(
project=args.wandb_project_name,
entity=args.wandb_entity,
Expand Down
Loading

0 comments on commit 5643515

Please sign in to comment.