You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems there is a useless for loop in the main training loop of A2C that is not in the original algorithm. After removing the loop, performances of the A2C agent match stable-baselines' A2C. @riccardodv and I think the problem is that this loop induces repeated gradient steps in the same direction for k_epochs steps.
The text was updated successfully, but these errors were encountered:
Yes I agree with @mmcenta , PPO does not have the same objective function as A2C. I think the bad performances of A2C were due to repeated gradient steps. There is another point that is that the rlberry A2C default hyperparameters are not the same as SB3's but it is not an issue.
rlberry/rlberry/agents/torch/a2c/a2c.py
Lines 246 to 273 in 8168dfc
It seems there is a useless for loop in the main training loop of A2C that is not in the original algorithm. After removing the loop, performances of the A2C agent match stable-baselines' A2C. @riccardodv and I think the problem is that this loop induces repeated gradient steps in the same direction for k_epochs steps.
The text was updated successfully, but these errors were encountered: