[Feature Request] Implement GRPO, PPO and potentially other policy gradient methods to finetune LM Agents #1528

apokryphosx · 2025-01-30T16:03:24Z

Required prerequisites

I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
Consider asking first in a Discussion.

Motivation

After recent success in Deepseek-R1, leveraging synthetic data with RL is vital on the path to more capable models. Particularly interesting is the idea of rule-based reward signals. A pipeline to finetune an LM in camel with RL would be a great addition

Solution

No response

Alternatives

No response

Additional context

I would like to work on this issue, if its possible please assign it to me

GitHoobar · 2025-01-30T19:45:40Z

this looks good! happy to help

lightaime · 2025-01-31T16:07:59Z

There are some projects we can look into for implementing this:

veRL: https://github.com/volcengine/verl
OpenRLHF: https://github.com/OpenRLHF/OpenRLHF
TinyZere (they use veRL): https://github.com/Jiayi-Pan/TinyZero
simpleRL-reason (they use OpenRLHF): https://github.com/hkust-nlp/simpleRL-reason
Open-R1: https://github.com/huggingface/open-r1

apokryphosx added the enhancement New feature or request label Jan 30, 2025

apokryphosx linked a pull request Feb 6, 2025 that will close this issue

docs: Created a cookbook that walks you through finetuning a Model with GRPO #1559

Open

8 tasks

Wendong-Fan added New Feature and removed enhancement New feature or request labels Feb 6, 2025

Wendong-Fan assigned apokryphosx Feb 6, 2025

Wendong-Fan linked a pull request Feb 6, 2025 that will close this issue

docs: Created a cookbook that walks you through finetuning a Model with GRPO #1559

Open

8 tasks

Wendong-Fan added this to Project Camel Feb 6, 2025

Wendong-Fan added this to the Sprint 22 milestone Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Implement GRPO, PPO and potentially other policy gradient methods to finetune LM Agents #1528

[Feature Request] Implement GRPO, PPO and potentially other policy gradient methods to finetune LM Agents #1528

apokryphosx commented Jan 30, 2025

GitHoobar commented Jan 30, 2025

lightaime commented Jan 31, 2025

[Feature Request] Implement GRPO, PPO and potentially other policy gradient methods to finetune LM Agents #1528

[Feature Request] Implement GRPO, PPO and potentially other policy gradient methods to finetune LM Agents #1528

Comments

apokryphosx commented Jan 30, 2025

Required prerequisites

Motivation

Solution

Alternatives

Additional context

GitHoobar commented Jan 30, 2025

lightaime commented Jan 31, 2025