-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
π¨βπ¨βπ§βπ§ GRPO #2565
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
From the paper: where:
In Section 4.2 we can read:
It implies that from 4.1.2, where we know that As a result, the GRPO objective just minimizes the KL divergence between the policy model and the reference policy. |
@ZhihongShao if you can help by any chance |
I have the answer to the question above. The math is correct; if you look at the loss, it is indeed equal to the KL value. However, in terms of differentiation, we cannot remove justifying that it equals 1. Letβs denote the stop-gradient operator as We must retain the term in the equation, rewriting it as Finally, the objective is written as In the end, the value remains equal to the KL divergence as initially stated. However, when implemented in this way, the gradient can propagate through the equation, allowing the policy to update effectively. Thanks @edbeeching for helping me with this!!! |
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.