Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPO for RL on agent trajectories #2715

Open
korbinian-hoermann opened this issue Jan 31, 2025 · 1 comment
Open

GRPO for RL on agent trajectories #2715

korbinian-hoermann opened this issue Jan 31, 2025 · 1 comment
Labels
🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling

Comments

@korbinian-hoermann
Copy link

korbinian-hoermann commented Jan 31, 2025

Hi,

I am new in the field of RL for LLMs, and I was wondering if it is possible to use GRPO to perform RL on LLM based Agent trajectories where the reward function is an ORM on the full trajectory (sparse)? In this case, the generated data for a task would not just be an answer that can be evaluated, but rather a multi-turn conversation, which includes the new observation in form of the next prompt.

It would look like this instead:

{
    "trajectory_id": "trajectory_1",
    "steps": [
      {
        "prompt": "task: search for cat images \naction_history: none \ncurrent_screenshot: <image_step_1.png>",
        "agent_response": "navigate to google.com"
      },
      {
        "prompt": "task: search for cat images \naction_history: navigate to google.com \ncurrent_screenshot: <image_step_2.png>"
        "agent_response": "click on the search bar"
      },
      // ... more steps needed for task completion
    ],
    "reward": 1.0  (ORM based reward)
  },

A naive approach i can think of would be:

  1. For each task sample k completions
  2. Filter (shortest) successful trajectories
  3. Generate dataset for GRPO by converting the prompt for each timestep into a separate question
  4. Perform GRPO with this dataset.

As this would, again, involve sampling n times for each question, this seems very resource intensive.

I would appreciate any input on that! :)

@github-actions github-actions bot added 🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling labels Jan 31, 2025
@August-murr
Copy link
Collaborator

August-murr commented Jan 31, 2025

#2723

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling
Projects
None yet
Development

No branches or pull requests

2 participants