This repo contains the code for the paper Post-hoc Reward Calibration: A Case Study on Length Bias. We propose to use Locally Weighted Regression (LWR) for bias estimation, which is then removed, thereby approximating the underlying true reward. Focusing on the prevalent length bias, we validate the proposed method in three different settings:
- Calibrated Reward for RewardBench benchmark.
- Calibrated Reward for LLMs evaluation.
- Calibrated Reward for LLMs alignment.
You can use the RewardBench setting to understand our method quickly. We calibrate different reward models from the RewardBench leaderboard and re-evaluate them on RewardBench. A 3.11 performance gain averaged across 33 models is observed:
- Download the official reward results on RewardBench from https://huggingface.co/datasets/allenai/reward-bench-results. You can use:
git lfs clone https://huggingface.co/datasets/allenai/reward-bench-results
- Change the
DIR
andHF_TOKEN
insrc/calibrate_rewardbench.py
- After running
sh calibrate_rewardbench.sh
, the calibrated results are saved inresults/calibrated_rewardbench
- To get the figure illustrating the calibration effect, use
notebooks/rewardbench_results.ipynb
In this paper, based on the AlpacaEval leaderboard, we demonstrate that the calibrated open-source BT-based reward models are not only improved on RewardBench but also have the potential to provide GPT-level evaluation for LLMs.
- Download the AlpacaEval results from https://github.com/tatsu-lab/alpaca_eval/tree/main/results
- Download the reward models you want to calibrate. We provide the individual running scripts for the rm model calibrated in the paper. Please see
run_rm
directory. - After running rm on the AlpacaEval benchmark and getting the rewarding results, use
calibrate_alpacaeval.sh
for calibration andnotebook/alpacaeval_results.ipynb
to visualise the results
We directly use the code from https://github.com/huggingface/alignment-handbook/tree/main for DPO training. We provide the AlpacaEval results in our repo in results
.
If you find this work useful or relevant to your research, please consider citing this paper:
@article{huang2024post,
title={Post-hoc Reward Calibration: A Case Study on Length Bias},
author={Huang, Zeyu and Qiu, Zihan and Wang, Zili and Ponti, Edoardo M and Titov, Ivan},
journal={arXiv preprint arXiv:2409.17407},
year={2024}
}