This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.
A curated list of materials providing an introduction to RL and RLHF:
- Research papers and books covering key concepts in reinforcement learning.
- Video lectures explaining the fundamentals of RLHF.
An extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:
- Key techniques such as PPO, DPO, KTO, ORPO, and more.
- The latest ArXiv publications and publicly available implementations.
- Analysis of effectiveness across different optimization strategies.
This repository is designed as a reference for researchers and engineers working on reinforcement learning and large language models. If you're interested in model alignment, experiments with DPO and its variants, or alternative RL-based methods, you will find valuable resources here.
- Reinforcement Learning: An Overview
- A COMPREHENSIVE SURVEY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE
- Book-Mathematical-Foundation-of-Reinforcement-Learning
- The FASTEST introduction to Reinforcement Learning on the internet
- rlhf-book
- Notes on reinforcement learning
- PPO - Proximal Policy Optimization Algorithm - OpenAI
- DPO - Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Standford
- online DPO
- KTO - KTO: Model Alignment as Prospect Theoretic Optimization
- SimPO imple Preference Optimization with a Reference-Free Reward - Princeton
- ORPO - Monolithic Preference Optimization without Reference Model - Kaist AI
- Sample Efficient Reinforcement Learning with REINFORCE
- REINFORCE++
- RPO Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
- RLOO - Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
- GRPO
- ReMax - Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
- DPOP - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
- BCO - Binary Classifier Optimization for Large Language Model Alignment
Method |
---|
DPO |
Notes for learning RL: Value Iteration -> Q Learning -> DQN -> REINFORCE -> Policy Gradient Theorem -> TRPO -> PPO
- CS234: Reinforcement Learning Winter 2025
- CS285 Deep Reinforcement Learning
- Welcome to Spinning Up in Deep RL
- deep-rl-course from Huggingface
- RL Course by David Silver
- Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.
- Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
- GRPO vs PPO
- Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights
-
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
-
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
-
s1: Simple test-time scaling and s1.1
-
The 37 Implementation Details of Proximal Policy Optimization
-
Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead and github
-
How to align open LLMs in 2025 with DPO & and synthetic data
-
DeepSeek-R1 -> The Illustrated DeepSeek-R1, DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs, DeepSeek R1 and R1-Zero Explained
-
2025.02.22
- Small Models Struggle to Learn from Strong Reasoners
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
- LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
- Open Reasoner Zero An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
- ✨ LLM Reasoning: Curated Insights
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- LLM Post-Training: A Deep Dive into Reasoning Large Language Models
- SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
- ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
- A Minimalist Approach to Offline Reinforcement Learning
- Training Language Models to Reason Efficiently
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
- [R1 - distill] OpenR1-Math-220k
- [R1 - distill] s1K-1.1
- [R1 - distill] OpenThoughts-114k
- [R1 - distill] LIMO
- [R1 - distill] NuminaMath-CoT
- [Llama-70B - distill] natural_reasoning - licence for non commercial use
- Open Reasoning Data
- Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models