release the code for training the reward model

OpenLMLab · Jan 15, 2024 · 5096bd9 · 5096bd9
1 parent cf898ef
commit 5096bd9
Show file tree

Hide file tree

Showing 12 changed files with 855 additions and 28 deletions.
diff --git a/README.md b/README.md
@@ -11,49 +11,60 @@
 [![Data License](https://img.shields.io/badge/Data%20License-CC%20BY--NC%204.0-blue.svg)](./DATA_LICENSE)
 [![Model License](https://img.shields.io/badge/Model%20License-GNU%20AGPL%203.0-red.svg)](./MODEL_LICENSE)
 
-### *MOSS-RLHF <br>👉 <a href="https://openlmlab.github.io/MOSS-RLHF/" target="_blank">[Home page]*
+### *MOSS-RLHF `<br>`👉 `<a href="https://openlmlab.github.io/MOSS-RLHF/" target="_blank">`[Home page]*
 
-### *"Secrets of RLHF in Large Language Models Part I: PPO" <br>👉 <a href="https://arxiv.org/abs/2307.04964" target="_blank">[Technical report I]</a>*
-
-### *"Secrets of RLHF in Large Language Models Part II: Reward Modeling" <br>👉 <a href="https://arxiv.org/abs/TBD" target="_blank">[Technical report II]</a>*
+### *"Secrets of RLHF in Large Language Models Part I: PPO" `<br>`👉 `<a href="https://arxiv.org/abs/2307.04964" target="_blank">`[Technical report I]`</a>`*
 
+### *"Secrets of RLHF in Large Language Models Part II: Reward Modeling" `<br>`👉 `<a href="https://arxiv.org/abs/2401.06080" target="_blank">`[Technical report II]`</a>`*
 
 ## 🌟🌟🌟 Breaking News
-👉 Wait a minute ! The Code of the second paper is coming soon!
 
-👉 Sat, 6. January 2024. We will release the second paper **"Secrets of RLHF in Large Language Models Part II: Reward Modeling"**!
+👉 Mon, 15. January 2024. We have released the code for training the reward model and the annotated hh-rlhf dataset([hh-rlhf-strength-cleaned)](https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned "https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned")!
 
+👉 Fri, 12. January 2024. We have released the second paper **"Secrets of RLHF in Large Language Models Part II: Reward Modeling"**!
 
 ## 🌟 News
-👉 Wed, 12. July 2023. We have released Chinese reward model based OpenChineseLlama-7B! 
+
+👉 Wed, 12. July 2023. We have released Chinese reward model based OpenChineseLlama-7B!
 [moss-rlhf-reward-model-7B-zh](https://huggingface.co/Ablustrund/moss-rlhf-reward-model-7B-zh/tree/main)
-<br>
+`<br>`
 
-👉 Thu, 13. July 2023. We have released English reward model and SFT model based Llama-7B! 
+👉 Thu, 13. July 2023. We have released English reward model and SFT model based Llama-7B!
 [moss-rlhf-reward-model-7B-en](https://huggingface.co/fnlp/moss-rlhf-reward-model-7B-en)
 
 [moss-rlhf-sft-model-7B-en](https://huggingface.co/fnlp/moss-rlhf-sft-model-7B-en)
-<br>
+`<br>`
 
 👉 Wait a minute ! Thu, 14. July 2023. We have released English policy model after aligning with RLHF!
 [moss-rlhf-policy-model-7B-en](https://huggingface.co/fnlp/moss-rlhf-policy-model-7B-en)
-<br>
+`<br>`
 
 ## 🧾 Open-source List
-- [x] Open source code for RL training in large language models.
-- [x] A 7B Chinese reward model based on openChineseLlama.
-- [x] A 7B English reward model based on Llama-7B.
-- [x] SFT model for English.
-- [x] Policy model for English after RLHF.
+
+### RL related
+
+- [X] Open source code for RL training in large language models.
+- [X] A 7B Chinese reward model based on openChineseLlama.
+- [X] A 7B English reward model based on Llama-7B.
+- [X] SFT model for English.
+- [X] Policy model for English after RLHF.
+
+### RM related
+
+- [X] Open source code for reward model training in large language models.
+- [X] HH-RLHF dataset with preference strength annotation.
+- [X] HH-RLHF validation set cleaned by GPT-4.
+
 - ...
 
 ## 🌠 Introduction
 
 Due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle.
 In this technical report, we intend to help researchers to train their models stably with human feedback.
 
-Contributions are summarized as follows: 
-1) We release competitive Chinese and English reward models, respectively, which have good cross-model generalization ability, alleviating the cost of relabeling human preference data; 
+Contributions are summarized as follows:
+
+1) We release competitive Chinese and English reward models, respectively, which have good cross-model generalization ability, alleviating the cost of relabeling human preference data;
 2) We conduct in-depth analysis on the inner workings of PPO algorithm and propose the PPO-max algorithm to ensure stable model training;
 3) We release the complete PPO-max codes to ensure that the LLMs in the current SFT stage can be better aligned with humans.
 
@@ -65,25 +76,28 @@ Contributions are summarized as follows:
 <img style="width: 80%; min-width: 500px; display: block; margin: auto; margin-bottom: 20px" alt="MOSS-RLHF" src="./assets/img/img2.jpg">
 </div>
 
-
 ## 🔩 Requirements & Setup
 
 This repository works on Python 3.8 and PyTorch 1.13.1.
 
 We recommend using the **conda** virtual environment to run the code.
 
 #### Step 1: Create a new Python virtual environment
+
 ```bash
 conda update conda -n base -c defaults
 conda create -n rlhf python=3.8
 conda activate rlhf
 ```
+
 #### Step 2: Install PyTorch and TensorBoard
+
 ```bash
 conda install pytorch==1.13.1 pytorch-cuda=11.7 tensorboard -c pytorch -c nvidia
 ```
 
 #### Step 3: Install the remaining dependencies
+
 ```bash
 conda install datasets accelerate safetensors chardet cchardet -c huggingface -c conda-forge
 pip3 install transformers sentencepiece einops triton==1.0.0 rouge jionlp==1.4.14 nltk sacrebleu cpm_kernels
@@ -93,13 +107,18 @@ DS_BUILD_OPS=1 pip install deepspeed
 ```
 
 ## ✨ Start training your own model!
+
+### Training PPO model
+
 Run code in a few steps.
 
-### Step 1: Recover Reward model weights
+#### Step 1: Recover Reward model weights
+
 We can not directly release the full weight of the reward model because of protocol restrictions.
 You can merge the diff weight with original Llama-7B to recover the reward model we used.
 
 We upload the diff models, thanks to tatsu-lab, you can recover the reward model follow these steps:
+
 ```bash
 1) Download the weight diff into your local machine. The weight diff is located at:
 # For English:
@@ -124,20 +143,35 @@ python merge_weight_en.py recover --path_raw decapoda-research/llama-7b-hf --pat
 # For Chinese:
 python merge_weight_zh.py recover --path_raw decapoda-research/llama-7b-hf --path_diff ./models/moss-rlhf-reward-model-7B-zh/diff --path_tuned ./models/moss-rlhf-reward-model-7B-zh/recover
 ```
-### Step 2: Select your own SFT model.
+
+#### Step 2: Select your own SFT model.
+
 Because of some limitations, we can not release the **Chinese** SFT model (Currently).
 You can use your own SFT model, or a strong base model instead of our SFT model.
 
-### Step 3: Start training
+#### Step 3: Start training
+
 Run the command below.
+
 ```
 # For Chinese:
 # You need to use your own sft model currently.
-bash run_zh.sh
+bash train_ppo_zh.sh
 
 # For English:
 # We have loaded the sft model and reward model to huggingface.
-bash run_en.sh
+bash train_ppo_en.sh
+
+```
+
+### Training reward model
+
+To train the reward model, you need to specify the initial model (`--hf_model_name_or_path`) for the reward model (e.g., meta-llama/Llama-2-7b-hf) and preference dataset(`--data_path`) (such as hh-rlhf, or you can use our provided [annotated hh-rlhf](https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned "https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned") which has a format consistent with the training code), and run the command below.
+
+```
+# annotated dataset: https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned
+# Assuming you have specified the --hf_model_name_or_path and --data_path parameters.
+bash train_rm.sh
 ```
 
 ## Citation

diff --git a/config.py → config_ppo.py b/config.py → config_ppo.py
diff --git a/config_rm.py b/config_rm.py
@@ -0,0 +1,41 @@
+import argparse
+
+def parse_args(*args):
+    parser = argparse.ArgumentParser(description='MOSS-RLHF Reward Model @Fudan NLP Group')
+    # training settings
+    parser.add_argument('--seed', type=int, default=42, help='seed')
+    parser.add_argument('--lr', type=float, default=5e-6, help='learning rate of reward model')
+    parser.add_argument('--batch_size', type=int, default=8, help='training batch size for single GPU')
+    parser.add_argument('--gradient_checkpoint', action='store_true', help='deepspeed')
+    parser.add_argument('--reward_lm_loss_factor', type=float, default=0., help='calculate lm loss on rm model')
+    parser.add_argument('--warmup_steps', type=int, default=500, help='warmup steps')
+    parser.add_argument('--train_steps', type=int, default=10000, help='train steps')
+    parser.add_argument('--fp32_loss', action='store_true', help='use fp32 to calculate cross-entropy loss, enable when numeric stability problem occurs')
+    parser.add_argument('--save_per_step', type=int, default=200, help='save ckpt and save validation tensorboard per steps')
+    parser.add_argument('--print_interval', type=int, default=5, help='print training state and save training tensorboard per steps')
+    parser.add_argument('--validation_metric', type=str, default='loss', help='metric to select the best model')
+
+    # Optimizer , Scheduler and Dataloader
+    parser.add_argument('--beta1', type=float, default=0.9, help='adam')
+    parser.add_argument('--beta2', type=float, default=0.95, help='adam')
+    parser.add_argument('--eps', type=float, default=1e-6, help='optimizer')
+    parser.add_argument('--num_prefetch', type=int, default=32, help='dataloader')
+    parser.add_argument('--num_workers', type=int, default=1, help='dataloader')
+    parser.add_argument('--weight_decay', type=float, default=0., help='l2 weight decay')
+
+    # Path
+    parser.add_argument('--data_path', type=str, default='./data', help='dataset for training and validation')
+    parser.add_argument('--init_checkpoint_model', type=str, default=None, help='checkpoint used to initialize the model, used for fine-tuning')
+    parser.add_argument('--logdir', type=str, default=None, help='path to save tensorboard logs')
+    parser.add_argument('--model_save_path', type=str, default='./outputs/', help='checkpoint path, used for save model and training')
+    parser.add_argument('--hf_model_name_or_path', type=str, default='meta-llama/Llama-2-7b-hf', help='Hugging model name used to load tokenizer, configs and pretained models')
+
+    # LLM settings
+    parser.add_argument('--context_truncate', type=int, default=2048, help='max length for history')
+    parser.add_argument('--delimiter', type=str, default='\n', help='delimiter to seperate dialog history')
+
+
+    args = parser.parse_args()
+    return args
+
+
diff --git a/rm/__init__.py b/rm/__init__.py