GRPO for RL on agent trajectories #2715

korbinian-hoermann · 2025-01-31T09:09:54Z

Hi,

I am new in the field of RL for LLMs, and I was wondering if it is possible to use GRPO to perform RL on LLM based Agent trajectories where the reward function is an ORM on the full trajectory (sparse)? In this case, the generated data for a task would not just be an answer that can be evaluated, but rather a multi-turn conversation, which includes the new observation in form of the next prompt.

It would look like this instead:

{
    "trajectory_id": "trajectory_1",
    "steps": [
      {
        "prompt": "task: search for cat images \naction_history: none \ncurrent_screenshot: <image_step_1.png>",
        "agent_response": "navigate to google.com"
      },
      {
        "prompt": "task: search for cat images \naction_history: navigate to google.com \ncurrent_screenshot: <image_step_2.png>"
        "agent_response": "click on the search bar"
      },
      // ... more steps needed for task completion
    ],
    "reward": 1.0  (ORM based reward)
  },

A naive approach i can think of would be:

For each task sample k completions
Filter (shortest) successful trajectories
Generate dataset for GRPO by converting the prompt for each timestep into a separate question
Perform GRPO with this dataset.

As this would, again, involve sampling n times for each question, this seems very resource intensive.

I would appreciate any input on that! :)

The text was updated successfully, but these errors were encountered:

August-murr · 2025-01-31T19:43:53Z

#2723

github-actions bot added 🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling labels Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO for RL on agent trajectories #2715

GRPO for RL on agent trajectories #2715

korbinian-hoermann commented Jan 31, 2025 •

edited

Loading

August-murr commented Jan 31, 2025 •

edited

Loading

GRPO for RL on agent trajectories #2715

GRPO for RL on agent trajectories #2715

Comments

korbinian-hoermann commented Jan 31, 2025 • edited Loading

August-murr commented Jan 31, 2025 • edited Loading

korbinian-hoermann commented Jan 31, 2025 •

edited

Loading

August-murr commented Jan 31, 2025 •

edited

Loading