A2C fix #160

KohlerHECTOR · 2022-04-08T14:16:22Z

Lines 246 to 273 in 8168dfc

    
           # optimize policy for K epochs 
        
           for _ in range(self.k_epochs): 
        
               # evaluate old actions and values 
        
               action_dist = self.cat_policy(old_states) 
        
               logprobs = action_dist.log_prob(old_actions) 
        
               state_values = torch.squeeze(self.value_net(old_states)) 
        
               dist_entropy = action_dist.entropy() 
        
               # normalize the advantages 
        
               advantages = rewards - state_values.detach() 
        
               advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) 
        
               # find pg loss 
        
               pg_loss = -logprobs * advantages 
        
               loss = ( 
        
                   pg_loss 
        
                   + 0.5 * self.MseLoss(state_values, rewards) 
        
                   - self.entr_coef * dist_entropy 
        
               ) 
        
               # take gradient step 
        
               self.policy_optimizer.zero_grad() 
        
               self.value_optimizer.zero_grad() 
        
               loss.mean().backward() 
        
               self.policy_optimizer.step() 
        
               self.value_optimizer.step()

It seems there is a useless for loop in the main training loop of A2C that is not in the original algorithm. After removing the loop, performances of the A2C agent match stable-baselines' A2C. @riccardodv and I think the problem is that this loop induces repeated gradient steps in the same direction for k_epochs steps.

yfletberliac · 2022-04-08T17:44:26Z

Good catch!
Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.

mmcenta · 2022-04-12T12:50:53Z

Good catch! Also interesting to see that A2C does better without multiple gradient steps whereas PPO does.

It may be because PPO adds the constraint that the policy must be close to the one used for data collection.

@KohlerHECTOR Any updates on this? I've seen on Slack that it might have been a hyperparameter issue?

KohlerHECTOR · 2022-04-12T13:05:37Z

Yes I agree with @mmcenta , PPO does not have the same objective function as A2C. I think the bad performances of A2C were due to repeated gradient steps. There is another point that is that the rlberry A2C default hyperparameters are not the same as SB3's but it is not an issue.

riccardodv assigned mmcenta, yfletberliac, omardrwch and riccardodv Apr 8, 2022

mmcenta added bug Something isn't working question Further information is requested labels Apr 8, 2022

KohlerHECTOR mentioned this issue Apr 8, 2022

A2c matches sb3 #161

Merged

KohlerHECTOR closed this as completed Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2C fix #160

A2C fix #160

KohlerHECTOR commented Apr 8, 2022

yfletberliac commented Apr 8, 2022

mmcenta commented Apr 12, 2022

KohlerHECTOR commented Apr 12, 2022

A2C fix #160

A2C fix #160

Comments

KohlerHECTOR commented Apr 8, 2022

yfletberliac commented Apr 8, 2022

mmcenta commented Apr 12, 2022

KohlerHECTOR commented Apr 12, 2022