-
This codebase explores Reinforcement Learning with Human Feedback (RLHF) using unpaired preferences. Unlike the standard approach, where preferences are determined by comparing two prompt completions, we fine-tune the model based on preferences expressed as thumbs-up or thumbs-down ratings.
-
To do so, we implement a pointwise reward model and an unpaired RLOO trainer.
- Install the requirements from
requirements.txt
. - All scripts can run on a single A100-80GB GPU and have two variants, using full training or QLoRA.
- Supervised fine-tuning (SFT):
# Full SFT training of Qwen2.5-1.5B
bash jobs_local/train_sft_full.sh
# QLoRA SFT training of Zephyr-7b
bash jobs_local/train_sft_qlora.sh
- Pairwise reward model (RM) training:
# Full pairwise RM training of Qwen2.5-1.5B
bash jobs_local/train_reward_pairwise_full.sh
# QLoRA pairwise RM training of Zephyr-7b
bash jobs_local/train_reward_pairwise_qlora.sh
- RLOO training with paired preferences:
# Full training of standard RLOO of Qwen2.5-1.5B
bash jobs_local/train_rloo_paired_full.sh
# QLoRA training of standard RLOO of Zephyr-7b
bash jobs_local/train_rloo_paired_qlora.sh
-
Use the same SFT scripts as above.
-
Pointwise reward model (RM) training:
# Full training of pointwise RM of Qwen2.5-1.5B
bash jobs_local/train_reward_pointwise_full.sh
# QLoRA training of pointwise RM of Zephyr-7b
bash jobs_local/train_reward_pointwise_qlora.sh
- RLOO training with unpaired preferences:
# Full training of unpaired RLOO of Qwen2.5-1.5B
bash jobs_local/train_rloo_unpaired_full.sh
# QLoRA training of unpaired RLOO of Zephyr-7b
bash jobs_local/train_rloo_unpaired_qlora.sh
- KTO:
# Full training of KTO of Qwen2.5-1.5B
bash jobs_local/train_kto_full.sh
# QLoRA training of KTO of
bash jobs_local/train_kto_full.sh