Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Reward Usage in DPO Training #33

Open
vincezh2000 opened this issue Sep 16, 2024 · 1 comment
Open

Clarification on Reward Usage in DPO Training #33

vincezh2000 opened this issue Sep 16, 2024 · 1 comment

Comments

@vincezh2000
Copy link

In the RLHF workflow paper, the Reward Model is used to annotate new data generated by the LLM during the iterative DPO process, resulting in scalar values. According to Algorithm 1, the traditional RM+RLHF process incorporates these scalar values into the loss function. For example, if the reward r is 8 versus 80, the outcomes will differ accordingly.
image

However, with the DPO method, the training objective is based on DPO loss, which does not explicitly calculate the reward scalar. It seems that the only information used is the preference — that A is preferred over B. The paper does not provide specific details on how this is handled.
image

My question is: If we use the ArmoRM model for training with the iterative DPO method, will it still only use the information about which score is higher, rather than the actual scalar reward values? Is it sufficient to fully utilize the multi-object RM by just using it to label the preference pair?
image

@WeiXiongUST
Copy link
Collaborator

Yes, dpo does not leverage the absolute value of reward but only the ranking information. This is also natural for semi-supervised learning where we only use a hard version of binary signal (win v.s. lose) to reduce noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants