Clarification on Reward Usage in DPO Training #33

vincezh2000 · 2024-09-16T22:32:06Z

In the RLHF workflow paper, the Reward Model is used to annotate new data generated by the LLM during the iterative DPO process, resulting in scalar values. According to Algorithm 1, the traditional RM+RLHF process incorporates these scalar values into the loss function. For example, if the reward r is 8 versus 80, the outcomes will differ accordingly.

However, with the DPO method, the training objective is based on DPO loss, which does not explicitly calculate the reward scalar. It seems that the only information used is the preference — that A is preferred over B. The paper does not provide specific details on how this is handled.

My question is: If we use the ArmoRM model for training with the iterative DPO method, will it still only use the information about which score is higher, rather than the actual scalar reward values? Is it sufficient to fully utilize the multi-object RM by just using it to label the preference pair?

WeiXiongUST · 2024-09-16T22:34:51Z

Yes, dpo does not leverage the absolute value of reward but only the ranking information. This is also natural for semi-supervised learning where we only use a hard version of binary signal (win v.s. lose) to reduce noise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on Reward Usage in DPO Training #33

Clarification on Reward Usage in DPO Training #33

vincezh2000 commented Sep 16, 2024

WeiXiongUST commented Sep 16, 2024

Clarification on Reward Usage in DPO Training #33

Clarification on Reward Usage in DPO Training #33

Comments

vincezh2000 commented Sep 16, 2024

WeiXiongUST commented Sep 16, 2024