New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

About the Implementation of GRPO #2681

Open

macheng6 opened this issue Jan 29, 2025 · 1 comment

Labels

🏋 GRPO ❓ question

macheng6 commented Jan 29, 2025

Is the result of x-x.detach() all zero?

github-actions bot added 🏋 GRPO ❓ question labels

Member

qgallouedec commented Jan 29, 2025

So it's more a torch question, but I guess you refer to

trl/trl/trainer/grpo_trainer.py

Lines 412 to 413 in 4659ad9

    
           # x - x.detach() allows for preserving gradients from x 
        
           per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)

Yes, the result is zero, but you can't discard it, you need the gradient, see #2565 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment