-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDPG implementation fails to learn well on at least five MuJoCo-v2 envs for all three noise types. I report steps to reproduce and learning curve plots [and show that PPO2 seems to work fine]. #938
Comments
PPO2 Results as a Sanity CheckTo confirm that something related to DDPG is the issue (which might include DDPG-specific processing steps) I ran these commands for PPO2, using the same master branch commit as above. These commands:
Resulted in the following learning curves: Which were generated via a similar plotting script as I used in my post above. These look much better!
It therefore seems like something is wrong with DDPG. |
Hello, was the You can find hyperparams (and trained agents) for pybullet envs here (in the rl zoo) NOTE: it uses stable baselines version, but should be the same underlying algorithm. |
@araffin baselines/baselines/ddpg/ddpg.py Line 32 in ba2b017
|
A question about your experiments with OU Noise only: But this is only a side question. I don't think, that will solve your problem at all. But this clears, that there seems to be no difference in the hyperparameters of the Deepmind paper and the code from Baselines, when using OU noise only. |
@MoritzTaylor Just to be clear, the I agree that it will not solve DDPG's current performance issues. Note that tau by default is 0.01: baselines/baselines/ddpg/ddpg.py Line 42 in ba2b017
The above will override the 0.001 from the DDPG code here: baselines/baselines/ddpg/ddpg_learner.py Lines 66 to 71 in ba2b017
|
As a side note, in case you're interested in quickly trying something out before this issue gets resolved, I would highly recommend the TD3 author's official implementation (which is in pytorch though). Clean stand alone interface, and I benchmarked it a month or so back and it matched with the paper result. Perhaps seeing some differences could help resolve this issue as well. |
@sritee Good point. :) I am actually benchmarking with rlkit right now https://github.com/vitchyr/rlkit which seems to be similar to the original author's implementation. |
Hello, I have been doing a quick sanity check using the rl zoo. To reproduce, add that to Gaussian noise: HalfCheetah-v2:
n_timesteps: !!float 1e6
policy: 'MlpPolicy'
gamma: 0.99
memory_limit: 1000000
noise_type: 'normal'
noise_std: 0.2
batch_size: 64
normalize_observations: True
normalize_returns: False Param noise: HalfCheetah-v2:
n_timesteps: !!float 1e6
policy: 'LnMlpPolicy'
gamma: 0.99
memory_limit: 1000000
noise_type: 'adaptive-param'
noise_std: 0.2
batch_size: 64
normalize_observations: True
normalize_returns: False Command to run (with tensorboard support):
With one random seed, 100k steps on HalfCheetah-v2, I am getting those results (gaussian noise is orange, param noise is blue): So it seems something wrong happened with the original baselines (stable baselines is based on OpenAI Baselines of last year). EDIT: to match TD3 paper, you would need to change the network architecture too, using |
Ok, unfortunately I did not see that tau is set to 0.01 in the learn function. Of course this overwrites all other tau params. But since this tau occurs regardless of the type of noise, you should always set tau to 0.001 manually, or not? And neither in the parameter noise setup nor in the gaussian noise setup you set tau to 0.001 manually, or am I missing something again? But I am not sure if this really solves the problem, since the OU noise setup does not work properly as well. |
@MoritzTaylor Right, if this were a research paper, I would always keep tau = 0.001, or always keep it at 0.01 to keep experimental conditions the same while varying just one thing (the noise setting, if that was what I was investigating, but really I was just trying to see what setting would lead to any improvements in performance). I only changed it here because I thought I might as well at least try to see what happens. @araffin Thanks for the results. It does seem like something happened between the commit corresponding to stable-baselines and the commit corresponding to the most recent one on master, that I used to test. Out of curiosity do you have the exact commit for baselines here that corresponds to what stable-baselines uses? |
there was several changes/bug fixes afterward, but we forked it (apparently) from: |
Thanks, maybe something happened between then that caused changes in the environment processing code? From reading OpenAI's DDPG code, I can't find any obvious errors yet. Hmm @araffin just wondering, you report the "exploration policy" right? From looking at the DDPG code, there is an exploration environment by default, and there is a separate evaluation environment which will step in the environment with a deterministic policy (which is what we want all along with DDPG). I wonder if the evaluation policy can still do a good job even if the exploration policy is very bad. My tests with other software packages that have DDPG show that the exploration policy does nearly as well as the evaluation policy so I am not entirely optimistic, but it is something to think about. |
Maybe. In the stable-baselines code, there is no preprocessing for DDPG.
Exactly. This is the behavioral policy, so the deterministic policy + noise in that case.
Yes, the eval env will only be used if you provide it. I did not provide any in my case.
I think it would do only in the case of high amount of noise. Otherwise, the training performance (of the exploration policy) is usually a good proxy for the real performance. |
@DanielTakeshi I think your intuition was good: Line 116 in ba2b017
It seems the normalization is applied twice (and reward normalization is also active by default). |
I commented the line out above. Results look much better, though they're not as good as some published results that I see (e.g., from the TD3 paper figure above). After avoiding the VecNormalize command, I ran the following:
For each of the seven environments, I ran one training run with three noise settings. I then have this following plot, which overlays the three noise settings together for each game to make comparisons easier: Note that these are the exploration environment rewards. I don't have an evaluation environment here. (It's easy to enforce an evaluation environment, and I'm actually testing that now, but there's no one single command we can add to the command line to get the evaluation environment set up.) Also, I plot these episode rewards from the 0.0.monitor.csv files that are stored in the log directory, and I smooth by a sliding window of size 20 across the episodes list.
It looks like parameter space noise might have a slight advantage, but I am not sure I can tell from these limited results. Gaussian noise on the actions seems easier conceptually. Overall these results are giving me much more confidence in the code base. :) And I think with evaluation rewards instead of exploration rewards, the above would be closer to published results for DDPG. |
@DanielTakeshi Did you run any of these benchmarks on vision-based tasks, or know of any results? |
@sritee No, I'm not aware of any standard vision-based DDPG tasks. You'd actually need to change the network design in the DDPG class as well |
@GameHoo |
@DanielTakeshi do you get the same issues if you downgrade mujoco as in openai/gym#1541 ? |
@christopherhesse Apologies for not responding, I never had time to follow-up on this, apologies. |
Dear @pzhokhov @matthiasplappert @christopherhesse et al.,
Thank you for providing an implementation of DDPG. However, I have been unable to get it to learn well on the standard MuJoCo environments by running the provided command in the README (and with related commands). Here are the steps to reproduce. I apologize for the length of the post, but I want to show what I tried to reduce ambiguity and to potentially counter the potential argument that it might be due to bad hyperparameters.
First, here's the machine I am using with relevant versions of software:
pip install
commands. I'm using TensorFlow 1.13, gym 0.12.1, and mujoco-py 2.0.2.2. All appear to be installed correctly and show no signs of error.Next, here are the set of commands to run. I'm splitting these into three groups based on the three types of noise we can inject into our policy.
Group 1: Parameter Noise
I first decided to take the default command provided in the README because I assumed that hyperparameters here have been tuned to save users the time and compute needed for expensive hyperparameter sweeps.
I use my plotting code to get plots. Here it is:
To use this code, just run
python [script].py --path [PATH] --title [TITLE]
. Feed in the path to the0.0.monitor.csv
file (i.e., that's the/tmp/openai-[DATE]
directory) and some title. I did this for all five environments above and got these results:None of these curves appear to be getting better than random performance. Maaaaaybe Ant-v2 is getting better than random performance, but it seems to be stuck at 0 and many, many papers report values far above 0. Perhaps it has something to do with the number of environments? I briefly tried increasing the number of parallel environments to 8 but that did not seem to work:
Incidentally, it seems like having N environments means that the actual number of steps increases by a factor of N. This is different behavior from PPO2 where increasing N does not change the number of actual time steps total at the end; increasing N for PPO2 means each individual environment can execute fewer steps.
PS: for some of the above plots, I did not run to exactly 1M steps, i.e., I terminated it near the end if it was clear that the algorithm was not learning well.
Group 2: Gaussian Noise
All right, next I decided to avoid parameter space noise. In the TD3 paper which used DDPG, the authors used Gaussian noise with standard deviation 0.1. I decided to try that, keeping all other settings fixed:
Here are the results:
Once again, it seems like there is no learning happening. The performance appears to be similar to the parameter space noise case.
Group 3: OU Noise (along with tau=0.001)
I decided to run one last batch of commands, this time with the original OU noise. After carefully checking the TD3 paper, and the DDPG directory from the July 27, 2017 commit when DDPG was first released, I saw that the
tau
parameter back then was set at 0.001. Now for some reason it is 0.01. DeepMind used 0.001 so I decided to try OU noise with tau 0.001. This appears to be the only hyperparameter difference that I can see from this code base and the values used by DeepMind.Results:
(The swimmer curve looks like it's going up, but the reward is lower as compared to the other two plots.)
The results I am getting seem to differ from the blog post here which shows HalfCheetah rewards of at least +1500, and much larger depending on the parameter noise setting, and for 2M steps. It might be a hyperparameter issue, but I'm not sure. In particular, notice that the hyperparameters here (for the most part) match those from the DDPG or TD3 papers.
The TD3 paper reports these results:
The TD3 paper says it used DDPG (presumably from OpenAI baselines as of late 2017?) and then "Our DDPG" above is when the author tuned hyperparameters. Both get final rewards that are far higher than what I am seeing, and we are all using 1M training steps here. Unfortunately, from reading the TD3 code base, it is not clear which commit from baselines was used for the results.
The paper above does not report results for Swimmer, so I looked at the "Benchmarking DeepRL" paper, which says DDPG on Swimmer should get 85 +/- 1.8, and this is far higher than the Swimmer results I am getting above.
I suspect that there must be have been some change to the code that caused it to somehow either stop working well or be exorbitantly sensitive to hyperparameters? For example, maybe the process of removing MPI caused some unexpected results? Or it could be due to MuJoCo environments v1 to v2, since the TD3 paper used MuJoCo v1 environments, but as this report suggests, RL performance should be similar. Notice that all the reward curves there for PPO show increasing reward, whereas I'm just seeing stagnation and noise for DDPG.
This is perhaps relevant to the following issue reports:
all of which have noticed issues with DDPG. If the fix is found, then the above can probably all be closed.
Hopefully in the spirit of my previous report on DQN here, we can resolve this issue together. Does anyone have any general suggestions or ideas about the potential causes? At this point I am unable to confidently use the DDPG code because it does not pass standard benchmarks. My previous issue report about DQN suggests that it could be an environment processing issue. Is the code processing the MuJoCo environments in a simliar way as in July 2017? Do the PPO2 results apper to be fine, but the DDPG results off? Is there a difference with how the two algorithms process observations and normalize data?
I'm happy to help investigate this if you have ideas on what might be the root cause. I only report this issue because having highly tuned algorithms and hyper-parameters ready to go "off-the-shelf" greatly helps the entire research community by accelerating research cycles and reducing the need to write our own error-prone implementations of algorithms.
Thanks!
The text was updated successfully, but these errors were encountered: