Thinking Model and RLHF Research Notes

This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.

Repository Contents

Reinforcement Learning and RLHF Overview

A curated list of materials providing an introduction to RL and RLHF:

Research papers and books covering key concepts in reinforcement learning.
Video lectures explaining the fundamentals of RLHF.

Methods for LLM Training

An extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:

Key techniques such as PPO, DPO, KTO, ORPO, and more.
The latest ArXiv publications and publicly available implementations.
Analysis of effectiveness across different optimization strategies.

Purpose of this Repository

This repository is designed as a reference for researchers and engineers working on reinforcement learning and large language models. If you're interested in model alignment, experiments with DPO and its variants, or alternative RL-based methods, you will find valuable resources here.

RL overview

Reinforcement Learning: An Overview
A COMPREHENSIVE SURVEY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE
Book-Mathematical-Foundation-of-Reinforcement-Learning
The FASTEST introduction to Reinforcement Learning on the internet
rlhf-book
Notes on reinforcement learning

Methods for LLM training

PPO - Proximal Policy Optimization Algorithm - OpenAI
DPO - Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Standford
online DPO
KTO - KTO: Model Alignment as Prospect Theoretic Optimization
SimPO imple Preference Optimization with a Reference-Free Reward - Princeton
ORPO - Monolithic Preference Optimization without Reference Model - Kaist AI
Sample Efficient Reinforcement Learning with REINFORCE
REINFORCE++
RPO Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
RLOO - Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
GRPO
ReMax - Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
DPOP - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
BCO - Binary Classifier Optimization for Large Language Model Alignment

Minimal implementation

Method
DPO

Tutorials

Notes for learning RL: Value Iteration -> Q Learning -> DQN -> REINFORCE -> Policy Gradient Theorem -> TRPO -> PPO

CS234: Reinforcement Learning Winter 2025
CS285 Deep Reinforcement Learning
Welcome to Spinning Up in Deep RL
deep-rl-course from Huggingface
RL Course by David Silver

RLHF training techniques explained

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
GRPO vs PPO
Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights

Training frameworks

VERL
OpenRLHF
TRL

RLHF methods implementation (only with detailed explanations)

GRPO

Articles

Reasoning LLMs
Process Reinforcement through Implicit Rewards
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
LIMR: Less is More for RL Scaling
LIMO: Less Is More for Reasoning
s1: Simple test-time scaling and s1.1
The 37 Implementation Details of Proximal Policy Optimization
Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead and github
a reinforcement learning guide
Approximating KL Divergence
How to align open LLMs in 2025 with DPO & and synthetic data
DeepSeek-R1 -> The Illustrated DeepSeek-R1, DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs, DeepSeek R1 and R1-Zero Explained
2025.02.22
- Small Models Struggle to Learn from Strong Reasoners
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
- LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
- Open Reasoner Zero An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Thinking process

Repos

Awesome-System2-Reasoning-LLM

Articles

✨ LLM Reasoning: Curated Insights
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
LLM Post-Training: A Deep Dive into Reasoning Large Language Models

Papers

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
A Minimalist Approach to Offline Reinforcement Learning
Training Language Models to Reason Efficiently
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

Open-source project to reproduce DeepSeek R1

DeepScaleR - Democratizing Reinforcement Learning for LLMs

Datasets - thinking models

[R1 - distill] OpenR1-Math-220k
[R1 - distill] s1K-1.1
[R1 - distill] OpenThoughts-114k
[R1 - distill] LIMO
[R1 - distill] NuminaMath-CoT
[Llama-70B - distill] natural_reasoning - licence for non commercial use
Open Reasoning Data
Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models

Evaluation and benchmarks

Open R1 - A fully open reproduction of DeepSeek-R1
GMIL CM Benchmark - Math Reasoning as an 11-Year-Old

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Thinking Model and RLHF Research Notes

Repository Contents

Reinforcement Learning and RLHF Overview

Methods for LLM Training

Purpose of this Repository

RL overview

Methods for LLM training

Minimal implementation

Tutorials

RLHF training techniques explained

Training frameworks

RLHF methods implementation (only with detailed explanations)

Articles

Thinking process

Repos

Articles

Papers

Open-source project to reproduce DeepSeek R1

Datasets - thinking models

Evaluation and benchmarks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Thinking Model and RLHF Research Notes

Repository Contents

Reinforcement Learning and RLHF Overview

Methods for LLM Training

Purpose of this Repository

RL overview

Methods for LLM training

Minimal implementation

Tutorials

RLHF training techniques explained

Training frameworks

RLHF methods implementation (only with detailed explanations)

Articles

Thinking process

Repos

Articles

Papers

Open-source project to reproduce DeepSeek R1

Datasets - thinking models

Evaluation and benchmarks