Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A2C fix #160

Closed
KohlerHECTOR opened this issue Apr 8, 2022 · 3 comments
Closed

A2C fix #160

KohlerHECTOR opened this issue Apr 8, 2022 · 3 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@KohlerHECTOR
Copy link
Collaborator

# optimize policy for K epochs
for _ in range(self.k_epochs):
# evaluate old actions and values
action_dist = self.cat_policy(old_states)
logprobs = action_dist.log_prob(old_actions)
state_values = torch.squeeze(self.value_net(old_states))
dist_entropy = action_dist.entropy()
# normalize the advantages
advantages = rewards - state_values.detach()
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
# find pg loss
pg_loss = -logprobs * advantages
loss = (
pg_loss
+ 0.5 * self.MseLoss(state_values, rewards)
- self.entr_coef * dist_entropy
)
# take gradient step
self.policy_optimizer.zero_grad()
self.value_optimizer.zero_grad()
loss.mean().backward()
self.policy_optimizer.step()
self.value_optimizer.step()

It seems there is a useless for loop in the main training loop of A2C that is not in the original algorithm. After removing the loop, performances of the A2C agent match stable-baselines' A2C. @riccardodv and I think the problem is that this loop induces repeated gradient steps in the same direction for k_epochs steps.

@mmcenta mmcenta added bug Something isn't working question Further information is requested labels Apr 8, 2022
@yfletberliac
Copy link
Member

Good catch!
Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.

@mmcenta
Copy link
Collaborator

mmcenta commented Apr 12, 2022

Good catch! Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.

It may be because PPO adds the constraint that the policy must be close to the one used for data collection.

@KohlerHECTOR Any updates on this? I've seen on Slack that it might have been a hyperparameter issue?

@KohlerHECTOR
Copy link
Collaborator Author

Yes I agree with @mmcenta , PPO does not have the same objective function as A2C. I think the bad performances of A2C were due to repeated gradient steps. There is another point that is that the rlberry A2C default hyperparameters are not the same as SB3's but it is not an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants