This is the repository for running the Iterative DPO with rule-based rewards. In every iteration, we sample responses from the model and label the rewards using the rule-based method. We then construct the preference pair based on the reward scores for DPO training. In our code, we perform iterative DPO starting with Qwen2.5-MATH-7B with prompts from Numina-Math. After the DPO training, our model achieves 26.7% on AIME24, 76.8% on MATH500, 62.5% on AME, 30.5% on Minerva-Math, and 37.9% on OlympiadBench, surpassing Llama-3.1-70B-Instruct and nearly on par with Eurus-2-7B-PRIME which adopts SFT and PPO training.

Illustration of the iterative DPO pipeline. Here the exploration is implemented via best-of-n v.s. worst of n sampling. In other words, we sample n responses and use the response with the highest reward and lowest reward as a preference pair. For RAFT training, the pipeline is similar except that we only use the positive data for fine-tuning.
We provide the model checkpoints from Huggingface:
- Qwen Warm-Up SFT: RLHFlow/Qwen2.5-7B-SFT
- Qwen-DPO-R1-Zero: RLHFlow/Qwen2.5-7B-DPO-Zero
- Qwen-DPO-R1: RLHFlow/Qwen2.5-7B-DPO
- Qwen-RAFT-R1-Zero: RLHFlow/Qwen2.5-7B-RAFT-Zero
- Qwen-PPO-R1-Zero:RLHFlow/Qwen2.5-7B-PPO-Zero
Inspired by the success of Deepseek-R1-Zero and several replications of PPO training which achieve superior performance on mathematical reasoning and demonstrate the “Aha moment” during RL training, we are curious about alternative algorithms in RL in this scenario. In this project, we implement rule-based RL from Qwen2.5-MATH-7B-base using iterative DPO and rejection sampling (RAFT), which are efficient and easy to implement. We train the models using the prompt set from the MATH training set and Numina-Math, and evaluate the models on AIME24, AMC23, MATH500, Minerva Math, and OlympiadBench. After several iterations, our models achieve an overall accuracy of 50.0% for DPO after SFT warm-up, 47.0% for DPO starting from the Base Model, and 44.4% for RAFT, compared to 33.9% for the Base Model. We list the result as follows:
AIME24 | MATH500 | AMC | Minerva Math | OlympiadBench | Average | |
---|---|---|---|---|---|---|
Base | 23.3 | 65.4 | 47.5 | 9.9 | 23.4 | 33.9 |
Qwen-Base + SFT Warm Up | 20.0 | 73.2 | 62.5 | 30.5 | 35.6 | 44.4 |
Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 |
Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
Qwen-DPO-NLL-R1-Zero | 30.0 | 74.4 | 62.5 | 26.1 | 37.9 | 46.2 |
Qwen-DPO-R1-Zero | 26.7 | 76.8 | 62.5 | 30.9 | 37.9 | 47.0 |
Qwen-DPO-R1-MATH7500-Zero | 26.7 | 72.2 | 57.5 | 26.8 | 37.2 | 44.1 |
Qwen-RAFT-R1-Zero | 20.0 | 77.6 | 55.0 | 30.5 | 38.7 | 44.4 |
Qwen-DPO-R1 | 30.0 | 84.4 | 62.5 | 33.5 | 48.4 | 51.8 |
Qwen-PPO-R1-MATH7500-Zero | 33.3 | 77.2 | 67.5 | 33.8 | 40.7 | 50.5 |
Qwen-PPO-R1-Zero | 43.3 | 79.4 | 62.5 | 33.1 | 40.7 | 51.8 |
Our key findings:
- DPO and RAFT significantly improve model performance while remaining efficient and easy to implement.
- Iterative DPO does NOT benefit from the additional Negative Log-Likelihood (NLL) loss.
- DPO with SFT warm-up contributes to the training and improves performance.
- Compared to the PPO algorithm (51.8%), DPO/RAFT achieves an inferior performance, showing that PPO is still one of the most effective RL algorithms in this context.
- SFT Warm-Up before DPO could improve the model performance (51.8%) and be competent with Qwen-PPO-R1-Zero.
We have two separate environments for running the Iterative DPO.
conda create -n vllm python=3.10.9
conda activate vllm
pip install datasets
# The following code is tested for CUDA12.0-12.2, and CUDA12.6
# To develop llama-3, mistral, gemma-1, 1.1, 2, deepseek you can consider the following vllm version
pip install vllm==0.5.4
pip install accelerate==0.33.0
pip install deepspeed==0.14.5
pip install transformers==4.48.1
pip install numpy==1.26.4 #Note that the numpy version should be `numpy<2.0`. `Numpy 2.0` will encounter unexpected issues!!!
pip install antlr4-python3-runtime==4.7.2
pip install sympy==1.12
pip install latex2sympy2==1.9.1
pip install word2number==1.1
conda create -n rlhflow python=3.10.9
conda activate rlhflow
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
git checkout 27f7dbf00663dab66ad7334afb7a1311fa251f41
pip3 install torch==2.1.2 torchvision torchaudio
python -m pip install .
pip install flash-attn==2.6.3
pip install accelerate==0.33.0
pip install huggingface-hub==0.24.7
pip install transformers==4.42.2
pip install peft==0.7.1 #We do not use peft, but some versions would cause errors.
pip install deepspeed==0.15.4
pip install trl==0.9.6
pip install wandb
bash run_iter_dpo.sh
We provide the evaluation scripts for all the benchmarks we use, including AIME24, AMC23, MATH500, OlympiadBench, and Minerva_Math. Please go to eval_math
folder for the detailed instructions.
The authors would like to thank the great open-source communities, including the developers of vLLM, VeRL, OpenRLHF, Qwen, and Axolotl for sharing their models, codes, and training recipes. We also thank the developers of DeepSeek-R1 for open-sourcing their state-of-the-art models, and innovative training methodologies.
If you find this blog our our codebase useful, it would be highly appreciated if you could consider citing our work by:
@misc{zhang2025dpor1,
title={Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead},
author={Hanning Zhang and Jiarui Yao and Chenlu Ye and Wei Xiong and Tong Zhang},
year={2025},
howpublished={\url{https://efficient-unicorn-451.notion.site/Online-DPO-R1-Unlocking-Effective-Reasoning-Without-the-PPO-Overhead-1908b9a70e7b80c3bc83f4cf04b2f175?pvs=4}},
note={Notion Blog}
}