You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the RLHF workflow paper, the Reward Model is used to annotate new data generated by the LLM during the iterative DPO process, resulting in scalar values. According to Algorithm 1, the traditional RM+RLHF process incorporates these scalar values into the loss function. For example, if the reward r is 8 versus 80, the outcomes will differ accordingly.
However, with the DPO method, the training objective is based on DPO loss, which does not explicitly calculate the reward scalar. It seems that the only information used is the preference — that A is preferred over B. The paper does not provide specific details on how this is handled.
My question is: If we use the ArmoRM model for training with the iterative DPO method, will it still only use the information about which score is higher, rather than the actual scalar reward values? Is it sufficient to fully utilize the multi-object RM by just using it to label the preference pair?
The text was updated successfully, but these errors were encountered:
Yes, dpo does not leverage the absolute value of reward but only the ranking information. This is also natural for semi-supervised learning where we only use a hard version of binary signal (win v.s. lose) to reduce noise.
In the RLHF workflow paper, the Reward Model is used to annotate new data generated by the LLM during the iterative DPO process, resulting in scalar values. According to Algorithm 1, the traditional RM+RLHF process incorporates these scalar values into the loss function. For example, if the reward r is 8 versus 80, the outcomes will differ accordingly.

However, with the DPO method, the training objective is based on DPO loss, which does not explicitly calculate the reward scalar. It seems that the only information used is the preference — that A is preferred over B. The paper does not provide specific details on how this is handled.

My question is: If we use the ArmoRM model for training with the iterative DPO method, will it still only use the information about which score is higher, rather than the actual scalar reward values? Is it sufficient to fully utilize the multi-object RM by just using it to label the preference pair?

The text was updated successfully, but these errors were encountered: