Post-hoc Reward calibration

This repo contains the code for the paper Post-hoc Reward Calibration: A Case Study on Length Bias. We propose to use Locally Weighted Regression (LWR) for bias estimation, which is then removed, thereby approximating the underlying true reward. Focusing on the prevalent length bias, we validate the proposed method in three different settings:

Calibrated Reward for RewardBench benchmark.
Calibrated Reward for LLMs evaluation.
Calibrated Reward for LLMs alignment.

Calibrated reward for RewardBench benchmark

You can use the RewardBench setting to understand our method quickly. We calibrate different reward models from the RewardBench leaderboard and re-evaluate them on RewardBench. A 3.11 performance gain averaged across 33 models is observed:

Download the official reward results on RewardBench from https://huggingface.co/datasets/allenai/reward-bench-results. You can use:git lfs clone https://huggingface.co/datasets/allenai/reward-bench-results
Change the DIR and HF_TOKEN in src/calibrate_rewardbench.py
After running sh calibrate_rewardbench.sh , the calibrated results are saved in results/calibrated_rewardbench
To get the figure illustrating the calibration effect, use notebooks/rewardbench_results.ipynb

Calibrated reward for LLMs evaluation

In this paper, based on the AlpacaEval leaderboard, we demonstrate that the calibrated open-source BT-based reward models are not only improved on RewardBench but also have the potential to provide GPT-level evaluation for LLMs.

Download the AlpacaEval results from https://github.com/tatsu-lab/alpaca_eval/tree/main/results
Download the reward models you want to calibrate. We provide the individual running scripts for the rm model calibrated in the paper. Please see run_rm directory.
After running rm on the AlpacaEval benchmark and getting the rewarding results, use calibrate_alpacaeval.sh for calibration and notebook/alpacaeval_results.ipynb to visualise the results

Calibrated reward for LLM alignment

We directly use the code from https://github.com/huggingface/alignment-handbook/tree/main for DPO training. We provide the AlpacaEval results in our repo in results.

Citation

If you find this work useful or relevant to your research, please consider citing this paper:

@article{huang2024post,
  title={Post-hoc Reward Calibration: A Case Study on Length Bias},
  author={Huang, Zeyu and Qiu, Zihan and Wang, Zili and Ponti, Edoardo M and Titov, Ivan},
  journal={arXiv preprint arXiv:2409.17407},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Post-hoc Reward calibration

Calibrated reward for RewardBench benchmark

Calibrated reward for LLMs evaluation

Calibrated reward for LLM alignment

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
notebooks		notebooks
results		results
run_rm		run_rm
src		src
.DS_Store		.DS_Store
README.md		README.md
calibrate_rewardbench.sh		calibrate_rewardbench.sh
calibration_alpacaeval.sh		calibration_alpacaeval.sh

ZeroYuHuang/Reward-Calibration

Folders and files

Latest commit

History

Repository files navigation

Post-hoc Reward calibration

Calibrated reward for RewardBench benchmark

Calibrated reward for LLMs evaluation

Calibrated reward for LLM alignment

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages