(WIP) A Little Bit of Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Introduction

Reinforcement learning from Human Feedback (RLHF) is a technique used to incorporate human information into AI systems. RLHF emerged primarily as a method to solve hard to specify problems. Its early applications were often in control problems and other traditional domains for reinforcement learning (RL). RLHF became most known through the release of ChatGPT and the subsequent rapid development of large language models (LLMs) and other foundation models.

The basic pipeline for RLHF involves three steps. First, a language model that can follow user questions must be trained (see Chapter 9). Second, human preference data must be collected for the training of a reward model of human preferences (see Chapter 7). Finally, the language model can be optimized with a RL optimizer of choice, by sampling generations and rating them with respect to the reward model (see Chapter 3 and 11). This book details key decisions and basic implementation examples for each step in this process.

RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured. Early breakthrough experiments with RLHF were applied to deep reinforcement learning [1], summarization [2], following instructions [3], parsing web information for question answering [4], and “alignment” [5].

In modern language model training, RLHF is one component of post-training. Post-training is a more complete set of techniques and best-practices to make language models more useful for downstream tasks [6]. Post-training can be summarized as using three optimization methods:

  1. Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and for base of instruction following abilities. This is largely about learning features in language.
  2. Preference Finetuning (PreFT),where we align to human preferences (and get smaller bump in capabilities at the same time). This is largely about style of language and subtle human preferences that are hard to quantify.
  3. Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains.

This book focuses on the second area, preference finetuning, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning. That being said, RLHF colloquially is what led to modern post-training. Soon after the release of ChatGPT, RLHF encompassed all of post-training. The foundations of RLHF involve far more than preferences alone and this book provides introductions to all the related topics.

How We Got Here

Why does this book make sense now? How much still will change?

Post-training, the craft of eliciting powerful behaviors from a raw pretrained language model, has gone through many seasons and moods since the release of ChatGPT that sparked the renewed interest in RLHF. In the era of Alpaca [7], Vicuna [8], [9], and Dolly [10], a limited number of human datapoints with extended synthetic data in the style of Self-Instruct were used to normally fine-tune the original LLaMA to get similar behavior to ChatGPT. The benchmark for these early models was fully vibes (and human evaluation) as we were all so captivated by the fact that these small models can have such impressive behaviors across domains. It was justified excitement.

Open post-training was moving faster, releasing more models, and making more noise than its closed counterparts. Companies were scrambling, e.g. DeepMind merging with Google or being started, and taking time to follow it up. There are phases of open recipes surging and then lagging behind.

The era following Alpaca et al., the first lag in open recipes, was one defined by skepticism and doubt on reinforcement learning from human feedback (RLHF), the technique OpenAI highlighted as crucial to the success of the first ChatGPT. Many companies doubted that they needed to do RLHF. A common phrase – “instruction tuning is enough for alignment” – was so popular then that it still holds heavy weight today despite heavy obvious pressures against it.

This doubt of RLHF lasted, especially in the open where groups cannot afford data budgets on the order of $100K to $1M. The companies that embraced it early ended up winning out. Anthropic published extensive research on RLHF through 2022 and is now argued to have the best post-training [11] [5] [12]. The delta between open groups, struggling to reproduce, or even knowing basic closed techniques, is a common theme.

The first shift in open alignment methods and post-training was the story of Direct Preference Optimization (DPO) [13]. The DPO paper, posted in May of 2023, didn’t have any clearly impactful models trained with it going through the fall of 2023. This changed with the releases of a few breakthrough DPO models – all contingent on finding a better, lower, learning rate. Zephyr-Beta [14], Tülu 2 [15], and many other models showed that the DPO era of post-training had begun. Chris Manning literally thanked me for “saving DPO.” This is how fine the margins are on evolutions of best practices with leading labs being locked down. Open post-training was cruising again.

Preference-tuning was something you needed to do to meet the table stakes of releasing a good model since late 2023. The DPO era continued through 2024, in the form of never-ending variants on the algorithm, but we were very far into another slump in open recipes. Open post-training recipes had saturated the extent of knowledge and resources available.
A year after Zephyr and Tulu 2, the same breakout dataset, UltraFeedback is arguably still state-of-the-art for preference tuning in open recipes [16].

At the same time, the Llama 3.1 [17] and Nemotron 4 340B [18] reports gave us substantive hints that large-scale post-training is much more complex and impactful. The closed labs are doing full post-training – a large multi-stage process of instruction tuning, RLHF, prompt design, etc. – where academic papers are just scratching the surface. Tülu 3 represented a comprehensive, open effort to build the foundation of future academic post-training research [6].

Today, post-training is a complex process involving the aforementioned training objectives applied in various orders in order to target specific capabilities. This book is designed to give a platform to understand all of these techniques, and in coming years the best practices for how to interleave them will emerge.

The primary areas of innovation in post-training are now in reinforcement finetuning, reasoning training, and related ideas. This newer methods build extensively on the infrastructure and ideas of RLHF, but are evolving far faster. This book is written to capture the first stable literature for RLHF after its initial period of rapid change.

Scope of This Book

This book hopes to touch on each of the core steps of doing canonical RLHF implementations. It will not cover all the history of the components nor recent research methods, just techniques, problems, and trade-offs that have been proven to occur again and again.

Chapter Summaries

While book is WIP, see navigation for up to date chapters.

This book has the following chapters following this Introduction:

Introductions:

  1. Introduction
  2. What are preferences?: The philosophy and social sciences behind RLHF.
  3. Optimization and RL: The problem formulation of RLHF.
  4. Seminal (Recent) Works: The core works leading to and following ChatGPT.

Problem Setup:

  1. Definitions: Mathematical reference.
  2. Preference Data: Gathering human data of preferences.
  3. Reward Modeling: Modeling human preferences for environment signal.
  4. Regularization: Numerical tricks to stabilize and guide optimization.

Optimization:

  1. Instruction Tuning: Fine-tuning models to follow instructions.
  2. Rejection Sampling: Basic method for using a reward model to filter data.
  3. Policy Gradients: Core RL methods used to perform RLHF.
  4. Direct Alignment Algorithms: New PreFT algorithms that do not need RL.

Advanced (TBD):

  1. Constitutional AI
  2. Synthetic Data
  3. Evaluation
  4. Reasoning and Reinforcement Finetuning

Open Questions (TBD):

  1. Over-optimization
  2. Style

Target Audience

This book is intended for audiences with entry level experience with language modeling, reinforcement learning, and general machine learning. It will not have exhaustive documentation for all the techniques, but just those crucial to understanding RLHF.

How to Use This Book

This book was largely created because there were no canonical references for important topics in the RLHF workflow. The contributions of this book are supposed to give you the minimum knowledge needed to try a toy implementation or dive into the literature. This is not a comprehensive textbook, but rather a quick book for reminders and getting started. Additionally, given the web-first nature of this book, it is expected that there are minor typos and somewhat random progressions – please contribute by fixing bugs or suggesting important content on GitHub.

About the Author

Dr. Nathan Lambert is a RLHF researcher contributing to the open science of language model fine-tuning. He has released many models trained with RLHF, their subsequent datasets, and training codebases in his time at the Allen Institute for AI (Ai2) and HuggingFace. Examples include Zephyr-Beta, Tulu 2, OLMo, TRL, Open Instruct, and many more. He has written extensively on RLHF, including many blog posts and academic papers.

Future of RLHF

With the investment in language modeling, many variations on the traditional RLHF methods emerged. RLHF colloquially has become synonymous with multiple overlapping approaches. RLHF is a subset of preference fine-tuning (PreFT) techniques, including Direct Alignment Algorithms (See Chapter 12). RLHF is the tool most associated with rapid progress in “post-training” of language models, which encompasses all training after the large-scale autoregressive training on primarily web data. This textbook is a broad overview of RLHF and its directly neighboring methods, such as instruction tuning and other implementation details needed to set up a model for RLHF training.

As more successes of fine-tuning language models with RL emerge, such as OpenAI’s o1 reasoning models, RLHF will be seen as the bridge that enabled further investment of RL methods for fine-tuning large base models.

Bibliography

[1]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
[2]
N. Stiennon et al., “Learning to summarize with human feedback,” Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021, 2020.
[3]
L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
[4]
R. Nakano et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.
[5]
Y. Bai et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
[6]
N. Lambert et al., “T\(\backslash\)" ULU 3: Pushing frontiers in open language model post-training,” arXiv preprint arXiv:2411.15124, 2024.
[7]
R. Taori et al., “Stanford alpaca: An instruction-following LLaMA model,” GitHub repository. https://github.com/tatsu-lab/stanford_alpaca; GitHub, 2023.
[8]
W.-L. Chiang et al., “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.” 2023. Available: https://lmsys.org/blog/2023-03-30-vicuna/
[9]
X. Geng et al., “Koala: A dialogue model for academic research.” Blog post, 2023. Accessed: Apr. 03, 2023. [Online]. Available: https://bair.berkeley.edu/blog/2023/04/03/koala/
[10]
M. Conover et al., “Hello dolly: Democratizing the magic of ChatGPT with open models.” Accessed: Jun. 30, 2023. [Online]. Available: https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
[11]
A. Askell et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.
[12]
Y. Bai et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.
[13]
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[14]
L. Tunstall et al., “Zephyr: Direct distillation of lm alignment,” arXiv preprint arXiv:2310.16944, 2023.
[15]
H. Ivison et al., “Camels in a changing climate: Enhancing lm adaptation with tulu 2,” arXiv preprint arXiv:2311.10702, 2023.
[16]
G. Cui et al., “Ultrafeedback: Boosting language models with high-quality feedback,” 2023.
[17]
A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
[18]
B. Adler et al., “Nemotron-4 340B technical report,” arXiv preprint arXiv:2406.11704, 2024.
← Previous: Home Next: Key Related Works →