You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am new in the field of RL for LLMs, and I was wondering if it is possible to use GRPO to perform RL on LLM based Agent trajectories where the reward function is an ORM on the full trajectory (sparse)? In this case, the generated data for a task would not just be an answer that can be evaluated, but rather a multi-turn conversation, which includes the new observation in form of the next prompt.
It would look like this instead:
{
"trajectory_id": "trajectory_1",
"steps": [
{
"prompt": "task: search for cat images \naction_history: none \ncurrent_screenshot: <image_step_1.png>",
"agent_response": "navigate to google.com"
},
{
"prompt": "task: search for cat images \naction_history: navigate to google.com \ncurrent_screenshot: <image_step_2.png>"
"agent_response": "click on the search bar"
},
// ... more steps needed for task completion
],
"reward": 1.0 (ORM based reward)
},
A naive approach i can think of would be:
For each task sample k completions
Filter (shortest) successful trajectories
Generate dataset for GRPO by converting the prompt for each timestep into a separate question
Perform GRPO with this dataset.
As this would, again, involve sampling n times for each question, this seems very resource intensive.
I would appreciate any input on that! :)
The text was updated successfully, but these errors were encountered:
Hi,
I am new in the field of RL for LLMs, and I was wondering if it is possible to use GRPO to perform RL on LLM based Agent trajectories where the reward function is an ORM on the full trajectory (sparse)? In this case, the generated data for a task would not just be an answer that can be evaluated, but rather a multi-turn conversation, which includes the new observation in form of the next prompt.
It would look like this instead:
A naive approach i can think of would be:
As this would, again, involve sampling n times for each question, this seems very resource intensive.
I would appreciate any input on that! :)
The text was updated successfully, but these errors were encountered: