A Little Bit of Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Training Overview

Problem Formulation

The optimization of reinforcement learning from human feedback (RLHF) builds on top of the standard RL setup. In RL, an agent takes actions, \(a\), sampled from a policy, \(\pi\), with respect to the state of the environment, \(s\), to maximize reward, \(r\) [1]. Traditionally, the environment evolves with respect to a transition or dynamics function \(p(s_{t+1}|s_t,a_t)\). Hence, across a finite episode, the goal of an RL agent is to solve the following optimization:

\[J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right],\qquad{(1)}\]

where \(\gamma\) is a discount factor from 0 to 1 that balances the desirability of near- versus future-rewards. Multiple methods for optimizing this expression are discussed in Chapter 11.

Figure 1: Standard RL loop

A standard illustration of the RL loop is shown in fig. 1 and how it compares to fig. 2.

Manipulating the Standard RL Setup

There are multiple core changes from the standard RL setup to that of RLHF:

  1. Switching from a reward function to a reward model. In RLHF, a learned model of human preferences, \(r_\theta(s_t, a_t)\) (or any other classification model) is used instead of an environmental reward function. This gives the designer a substantial increase in the flexibility of the approach and control over the final results.
  2. No state transitions exist. In RLHF, the initial states for the domain are prompts sampled from a training dataset and the “action” is the completion to said prompt. During standard practices, this action does not impact the next state and is only scored by the reward model.
  3. Response level rewards. Often referred to as a Bandits Problem, RLHF attribution of reward is done for an entire sequence of actions, composed of multiple generated tokens, rather than in a fine-grained manner.

Given the single-turn nature of the problem, the optimization can be re-written without the time horizon and discount factor (and the reward models): \[J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[r_\theta(s_t, a_t) \right].\qquad{(2)}\]

In many ways, the result is that while RLHF is heavily inspired by RL optimizers and problem formulations, the action implementation is very distinct from traditional RL.

Figure 2: Standard RLHF loop

Optimization Tools

In this book, we detail many popular techniques for solving this optimization problem. The popular tools of post-training include:

Modern RLHF-trained models always utilize instruction finetuning followed by a mixture of the other optimization options.

RLHF Recipe Example

The canonical RLHF recipe circa the release of ChatGPT followed a standard three step post-training recipe where RLHF was the center piece [2] [3] [4]. The three steps taken on top of a “base” language model (the next-token prediction model trained on large-scale web text) was:

  1. Instruction tuning on ~10K examples: This teaches the model to follow the question-answer format and teaches some basic skills from primarily human-written data.
  2. Training a reward model on ~100K pairwise prompts: This model is trained from the instruction-tuned checkpoint and captures the diverse values one wishes to model in their final training. The reward model is the optimization target for RLHF.
  3. Training the instruction-tuned model with RLHF on another ~100K prompts: The model is optimized against the reward model with a set of prompts that the model generates over before receiving ratings.

Once RLHF was done, the model was ready to be deployed to users. This recipe is the foundation of modern RLHF, but recipes have evolved substantially to include more stages and more data.

Finetuning and Regularization

RLHF is implemented from a strong base model, which induces a need to control the optimization from straying too far from the initial policy. In order to succeed in a finetuning regime, RLHF techniques employ multiple types of regularization to control the optimization. The most common change to the optimization function is to add a distance penalty on the difference between the current RLHF policy and the starting point of the optimization:

\[J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[r_\theta(s_t, a_t)\right] - \beta \mathcal{D}_{KL}(\pi^{\text{RL}}(\cdot|s_t) \| \pi^{\text{ref}}(\cdot|s_t)).\qquad{(3)}\]

Within this formulation, a lot of study into RLHF training goes into understanding how to spend a certain “KL budget” as measured by a distance from the initial model. For more details, see Chapter 8 on Regularization.

Bibliography

[1]
R. S. Sutton, “Reinforcement learning: An introduction,” A Bradford Book, 2018.
[2]
N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (RLHF),” Hugging Face Blog, 2022.
[3]
L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
[4]
Y. Bai et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
← Previous: Definitions & Background Next: The Nature of Preferences →