- [29/11/2024] 已添加 demo 页面至 ModelScope. 感谢 @wangxingjun778 !
- [24/10/2024] OpenR 现已支持 MCTS 推理 (#24)! 🌲
- [15/10/2024] 我们的报告已发布在 Arxiv 上!
- [12/10/2024] OpenR 已经发布! 🚀
- ✅ 过程监督的数据生成
- ✅ 在线策略训练
- ✅ Generative 和 Discriminative 过程奖励模型的训练
- ✅ 多种搜索策略
- ✅ Test-time 计算和 Scaling Law
功能 | 内容 |
---|---|
✅ 过程监督的数据生成 | - OmegaPRM: Improve Mathematical Reasoning in Language Models by Automated Process Supervision |
✅ 在线策略训练 | - 强化学习训练: 使用PRM进行在线RL训练 |
✅ PRM奖励模型的训练 | - PRM 训练: Supervised Training for PRMs - 生成式奖励模型训练: Direct GenRM |
✅ 多种搜索策略 | - Greedy Search - Best-of-N - Beam Search - MCTS - rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers - Critic-MCTS |
✅ Test-time Computation and Scaling Law | 详见 benchmark |
功能 | TODO (高优先级, 欢迎加入开发!) |
---|---|
👨💻数据 | - 复现 Journey Learning |
👨💻RL训练 | - 分布式训练 - Reinforcement Fine-Tuning (RFT) #80 |
👨💻PRM | - 更大规模训练 - GenRM-CoT 的训练实现 - Soft-label training #57 |
👨💻推理 | - 优化代码结构 #53 - 添加更多推理任务 (AIME, etc.) #53 - 多模态推理 #82 - 代码生成推理 #68 - Dots #75 - 推理精度检查 - Benchmarking |
详见 Benchmark !
MATH-APS (我们发布的数据集)
MATH-psa (我们发布的过程奖励模型)
conda create -n open_reasoner python=3.10
conda activate open_reasoner
pip install -r requirements.txt
pip3 install "fschat[model_worker,webui]"
pip install -U pydantic
cd envs/MATH/latex2sympy
pip install -e .
cd -
在运行项目之前,请确保已下载所有所需的基础模型。本项目使用的模型包括:
Qwen2.5-Math-1.5B-Instruct
,Qwen2.5-Math-7B-Instruct
peiyi9979/mistral-7b-sft
peiyi9979/math-shepherd-mistral-7b-prm
Huggingface 具体下载方式可参考 Huggingface 下载教程
在继续之前,请确保所有模型已根据项目设置保存在各自的目录中。
在运行推理之前,请修改reason/llm_service/
目录下脚本中的以下变量,以设置适合您使用的基座模型:
$MODEL_BASE
: 设置为存储模型的目录路径。$POLICY_MODEL_NAME
: 设置为您希望使用的策略模型的名称。$VALUE_MODEL_NAME
: 设置为您希望使用的Value模型的名称。$NUM_LM_WORKER
: 设置为要启动的语言模型(LM)进程的数量$NUM_RM_WORKER
: 设置为要启动的奖励模型(RM)进程的数量。
接下来,我们将使用不同的技术运行推理。
例如,要启动 Math Shepherd 模型的 LM 和 RM 服务,请运行以下命令:
sh reason/llm_service/create_service_math_shepherd.sh
关闭服务进程可以参考以下命令:
tmux kill-session -t {Your Session Name} # default is `FastChat`
--LM
, --RM
)与待运行的进程中的变量($POLICY_MODEL_NAME
, $VALUE_MODEL_NAME
)保持一致!
export PYTHONPATH=$(pwd)
sh scripts/eval/cot_greedy.sh
# Method: cot. Average result: ({'majority_vote': 0.734, 'total_completion_tokens': 559.13},)
sh scripts/eval/cot_rerank.sh
# Method: best_of_n. Average result: ({'majority_vote': 0.782,
# 'prm_min_max': 0.772,
# 'prm_min_vote': 0.792,
# 'prm_last_max': 0.776,
# 'prm_last_vote': 0.792,
# 'total_completion_tokens': 4431.268},)
sh scripts/eval/beam_search.sh
# Method: beam_search. Average result: ({'majority_vote': 0.74, 'total_completion_tokens': 2350.492},)
sh scripts/eval/vanila_mcts.sh
train/mat/scripts/train_llm.sh
文件中的 $dataset_path
, $model_name_or_path
和 $prm_name_or_path
项。
cd train/mat/scripts
bash train_llm.sh
cd prm/code
\\ single gpu
python finetune_qwen_single_gpu.py --model_path $YOUR_MODEL_PATH \
--train_data_path $TRAIN_DATA_PATH \
--test_data_path $TEST_DATA_PATH
\\ multi gpu
torchrun --nproc_per_node=2 finetune_qwen.py --model_path $YOUR_MODEL_PATH \
--data_path $YOUR_DATA_FOLDER_PATH \
--datasets both \
您的每一份贡献对社区来说都是宝贵的。
感谢您对 OpenR 的关注!🥰 我们致力于发展开源社区,并十分欢迎大家的contribution。无论大小,您的努力都将帮助我们成长和进步。贡献不仅限于代码——解答问题、帮助他人、改进我们的文档、分享项目同样具有深远的影响。
欢迎查阅 贡献指南 !
-
更全面的强化学习训练和搜索方法的实验
-
更大规模的Prove-Verifier模型
-
支持自我提升训练功能
OpenR 社区由以下团队维护:
- Openreasoner Team ([email protected])
OpenR is released under the MIT License.
如果您觉得我们的资源对您有帮助,请引用我们的论文:
@article{wang2024openr,
title={OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models},
author={Wang, Jun and Fang, Meng and Wan, Ziyu and Wen, Muning and Zhu, Jiachen and Liu, Anjie and Gong, Ziqin and Song, Yan and Chen, Lei and Ni, Lionel M and others},
journal={arXiv preprint arXiv:2410.09671},
year={2024}
}
十分感谢!
微信群聊:
[1] Alphazero-like tree-search can guide large language model decoding and training.
[2] Reasoning with language model is planning with world model.
[3] Scaling LLM test-time compute optimally can be more effective than scaling model parameters
[4] Think before you speak: Training language models with pause tokens
[1] Training verifiers to solve math word problems
[2] Solving math word problems with process-and outcome-based feedback
[4] Making large language models better reasoners with step-aware verifier
[5] Ovm, outcome-supervised value models for planning in mathematical reasoning
[6] Generative verifiers: Reward modeling as next-token prediction
[1] Star: Bootstrapping reasoning with reasoning
[2] Quiet-star: Language models can teach themselves to think before speaking
[3] Improve mathematical reasoning in language models by automated process supervision
[4] Shepherd: A critic for language model generation
[5] Math-shepherd: Verify and reinforce llms step-by-step without human annotations