Skip to content

Commit

Permalink
release the code for training the reward model
Browse files Browse the repository at this point in the history
  • Loading branch information
refrain-wbh committed Jan 15, 2024
1 parent cf898ef commit 5096bd9
Show file tree
Hide file tree
Showing 12 changed files with 855 additions and 28 deletions.
82 changes: 58 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,49 +11,60 @@
[![Data License](https://img.shields.io/badge/Data%20License-CC%20BY--NC%204.0-blue.svg)](./DATA_LICENSE)
[![Model License](https://img.shields.io/badge/Model%20License-GNU%20AGPL%203.0-red.svg)](./MODEL_LICENSE)

### *MOSS-RLHF <br>👉 <a href="https://openlmlab.github.io/MOSS-RLHF/" target="_blank">[Home page]*
### *MOSS-RLHF `<br>`👉 `<a href="https://openlmlab.github.io/MOSS-RLHF/" target="_blank">`[Home page]*

### *"Secrets of RLHF in Large Language Models Part I: PPO" <br>👉 <a href="https://arxiv.org/abs/2307.04964" target="_blank">[Technical report I]</a>*

### *"Secrets of RLHF in Large Language Models Part II: Reward Modeling" <br>👉 <a href="https://arxiv.org/abs/TBD" target="_blank">[Technical report II]</a>*
### *"Secrets of RLHF in Large Language Models Part I: PPO" `<br>`👉 `<a href="https://arxiv.org/abs/2307.04964" target="_blank">`[Technical report I]`</a>`*

### *"Secrets of RLHF in Large Language Models Part II: Reward Modeling" `<br>`👉 `<a href="https://arxiv.org/abs/2401.06080" target="_blank">`[Technical report II]`</a>`*

## 🌟🌟🌟 Breaking News
👉 Wait a minute ! The Code of the second paper is coming soon!

👉 Sat, 6. January 2024. We will release the second paper **"Secrets of RLHF in Large Language Models Part II: Reward Modeling"**!
👉 Mon, 15. January 2024. We have released the code for training the reward model and the annotated hh-rlhf dataset([hh-rlhf-strength-cleaned)](https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned "https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned")!

👉 Fri, 12. January 2024. We have released the second paper **"Secrets of RLHF in Large Language Models Part II: Reward Modeling"**!

## 🌟 News
👉 Wed, 12. July 2023. We have released Chinese reward model based OpenChineseLlama-7B!

👉 Wed, 12. July 2023. We have released Chinese reward model based OpenChineseLlama-7B!
[moss-rlhf-reward-model-7B-zh](https://huggingface.co/Ablustrund/moss-rlhf-reward-model-7B-zh/tree/main)
<br>
`<br>`

👉 Thu, 13. July 2023. We have released English reward model and SFT model based Llama-7B!
👉 Thu, 13. July 2023. We have released English reward model and SFT model based Llama-7B!
[moss-rlhf-reward-model-7B-en](https://huggingface.co/fnlp/moss-rlhf-reward-model-7B-en)

[moss-rlhf-sft-model-7B-en](https://huggingface.co/fnlp/moss-rlhf-sft-model-7B-en)
<br>
`<br>`

👉 Wait a minute ! Thu, 14. July 2023. We have released English policy model after aligning with RLHF!
[moss-rlhf-policy-model-7B-en](https://huggingface.co/fnlp/moss-rlhf-policy-model-7B-en)
<br>
`<br>`

## 🧾 Open-source List
- [x] Open source code for RL training in large language models.
- [x] A 7B Chinese reward model based on openChineseLlama.
- [x] A 7B English reward model based on Llama-7B.
- [x] SFT model for English.
- [x] Policy model for English after RLHF.

### RL related

- [X] Open source code for RL training in large language models.
- [X] A 7B Chinese reward model based on openChineseLlama.
- [X] A 7B English reward model based on Llama-7B.
- [X] SFT model for English.
- [X] Policy model for English after RLHF.

### RM related

- [X] Open source code for reward model training in large language models.
- [X] HH-RLHF dataset with preference strength annotation.
- [X] HH-RLHF validation set cleaned by GPT-4.

- ...

## 🌠 Introduction

Due to the challenges of reward design, environment interaction, and agent training, coupled with huge trial and error cost of large language models, there is a significant barrier for AI researchers to motivate the development of technical alignment and safe landing of LLMs. The stable training of RLHF has still been a puzzle.
In this technical report, we intend to help researchers to train their models stably with human feedback.

Contributions are summarized as follows:
1) We release competitive Chinese and English reward models, respectively, which have good cross-model generalization ability, alleviating the cost of relabeling human preference data;
Contributions are summarized as follows:

1) We release competitive Chinese and English reward models, respectively, which have good cross-model generalization ability, alleviating the cost of relabeling human preference data;
2) We conduct in-depth analysis on the inner workings of PPO algorithm and propose the PPO-max algorithm to ensure stable model training;
3) We release the complete PPO-max codes to ensure that the LLMs in the current SFT stage can be better aligned with humans.

Expand All @@ -65,25 +76,28 @@ Contributions are summarized as follows:
<img style="width: 80%; min-width: 500px; display: block; margin: auto; margin-bottom: 20px" alt="MOSS-RLHF" src="./assets/img/img2.jpg">
</div>


## 🔩 Requirements & Setup

This repository works on Python 3.8 and PyTorch 1.13.1.

We recommend using the **conda** virtual environment to run the code.

#### Step 1: Create a new Python virtual environment

```bash
conda update conda -n base -c defaults
conda create -n rlhf python=3.8
conda activate rlhf
```

#### Step 2: Install PyTorch and TensorBoard

```bash
conda install pytorch==1.13.1 pytorch-cuda=11.7 tensorboard -c pytorch -c nvidia
```

#### Step 3: Install the remaining dependencies

```bash
conda install datasets accelerate safetensors chardet cchardet -c huggingface -c conda-forge
pip3 install transformers sentencepiece einops triton==1.0.0 rouge jionlp==1.4.14 nltk sacrebleu cpm_kernels
Expand All @@ -93,13 +107,18 @@ DS_BUILD_OPS=1 pip install deepspeed
```

## ✨ Start training your own model!

### Training PPO model

Run code in a few steps.

### Step 1: Recover Reward model weights
#### Step 1: Recover Reward model weights

We can not directly release the full weight of the reward model because of protocol restrictions.
You can merge the diff weight with original Llama-7B to recover the reward model we used.

We upload the diff models, thanks to tatsu-lab, you can recover the reward model follow these steps:

```bash
1) Download the weight diff into your local machine. The weight diff is located at:
# For English:
Expand All @@ -124,20 +143,35 @@ python merge_weight_en.py recover --path_raw decapoda-research/llama-7b-hf --pat
# For Chinese:
python merge_weight_zh.py recover --path_raw decapoda-research/llama-7b-hf --path_diff ./models/moss-rlhf-reward-model-7B-zh/diff --path_tuned ./models/moss-rlhf-reward-model-7B-zh/recover
```
### Step 2: Select your own SFT model.

#### Step 2: Select your own SFT model.

Because of some limitations, we can not release the **Chinese** SFT model (Currently).
You can use your own SFT model, or a strong base model instead of our SFT model.

### Step 3: Start training
#### Step 3: Start training

Run the command below.

```
# For Chinese:
# You need to use your own sft model currently.
bash run_zh.sh
bash train_ppo_zh.sh
# For English:
# We have loaded the sft model and reward model to huggingface.
bash run_en.sh
bash train_ppo_en.sh
```

### Training reward model

To train the reward model, you need to specify the initial model (`--hf_model_name_or_path`) for the reward model (e.g., meta-llama/Llama-2-7b-hf) and preference dataset(`--data_path`) (such as hh-rlhf, or you can use our provided [annotated hh-rlhf](https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned "https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned") which has a format consistent with the training code), and run the command below.

```
# annotated dataset: https://huggingface.co/datasets/fnlp/hh-rlhf-strength-cleaned
# Assuming you have specified the --hf_model_name_or_path and --data_path parameters.
bash train_rm.sh
```

## Citation
Expand Down
File renamed without changes.
41 changes: 41 additions & 0 deletions config_rm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import argparse

def parse_args(*args):
parser = argparse.ArgumentParser(description='MOSS-RLHF Reward Model @Fudan NLP Group')
# training settings
parser.add_argument('--seed', type=int, default=42, help='seed')
parser.add_argument('--lr', type=float, default=5e-6, help='learning rate of reward model')
parser.add_argument('--batch_size', type=int, default=8, help='training batch size for single GPU')
parser.add_argument('--gradient_checkpoint', action='store_true', help='deepspeed')
parser.add_argument('--reward_lm_loss_factor', type=float, default=0., help='calculate lm loss on rm model')
parser.add_argument('--warmup_steps', type=int, default=500, help='warmup steps')
parser.add_argument('--train_steps', type=int, default=10000, help='train steps')
parser.add_argument('--fp32_loss', action='store_true', help='use fp32 to calculate cross-entropy loss, enable when numeric stability problem occurs')
parser.add_argument('--save_per_step', type=int, default=200, help='save ckpt and save validation tensorboard per steps')
parser.add_argument('--print_interval', type=int, default=5, help='print training state and save training tensorboard per steps')
parser.add_argument('--validation_metric', type=str, default='loss', help='metric to select the best model')

# Optimizer , Scheduler and Dataloader
parser.add_argument('--beta1', type=float, default=0.9, help='adam')
parser.add_argument('--beta2', type=float, default=0.95, help='adam')
parser.add_argument('--eps', type=float, default=1e-6, help='optimizer')
parser.add_argument('--num_prefetch', type=int, default=32, help='dataloader')
parser.add_argument('--num_workers', type=int, default=1, help='dataloader')
parser.add_argument('--weight_decay', type=float, default=0., help='l2 weight decay')

# Path
parser.add_argument('--data_path', type=str, default='./data', help='dataset for training and validation')
parser.add_argument('--init_checkpoint_model', type=str, default=None, help='checkpoint used to initialize the model, used for fine-tuning')
parser.add_argument('--logdir', type=str, default=None, help='path to save tensorboard logs')
parser.add_argument('--model_save_path', type=str, default='./outputs/', help='checkpoint path, used for save model and training')
parser.add_argument('--hf_model_name_or_path', type=str, default='meta-llama/Llama-2-7b-hf', help='Hugging model name used to load tokenizer, configs and pretained models')

# LLM settings
parser.add_argument('--context_truncate', type=int, default=2048, help='max length for history')
parser.add_argument('--delimiter', type=str, default='\n', help='delimiter to seperate dialog history')


args = parser.parse_args()
return args


Empty file added rm/__init__.py
Empty file.
Loading

0 comments on commit 5096bd9

Please sign in to comment.