A Little Bit of Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Direct Alignment Algorithms

Direct Alignment Algorithms (DAAs) allow one to update models to solve the same RLHF objective without ever training an intermediate reward model or using reinforcement learning optimizers. The most prominent DAA and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO) [1]. At its core, DPO is using gradient ascent to solve the same constrained RLHF objective. Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-\(\beta\) kickstarting it in October of 2024 [2], Llama 3 Instruct [3], Tülu 2 [4] and 3 [5], Nemotron 4 340B [6], and others. Technically, Sequence Likelihood Calibration (SLiC-HF) was released first [7], but it did not catch on due to a combination of luck and effectiveness.

The most impactful part of DPO and DAAs is lowering the barrier of entry to experimenting with language model post-training.

Direct Preference Optimization (DPO)

Here we explain intuitions for how it works and re-derive the core equations fully.

How DPO Works

DPO at a surface level is directly optimizing a policy to solve the RLHF objective. The loss function for this, which we will revisit below in the derivations, is a pairwise relationship of log-probabilities. The loss function derived from a Bradley-Terry reward model follows:

\[ \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_c, y_r) \sim \mathcal{D}}\left[ \log \sigma\left( \beta \log \frac{\pi_{\theta}(y_c \mid x)}{\pi_{\text{ref}}(y_c \mid x)} - \beta \log \frac{\pi_{\theta}(y_r \mid x)}{\pi_{\text{ref}}(y_r \mid x)} \right) \right] \qquad{(1)}\]

This relies on the implicit reward for DPO training that replaces using an external reward model, which is a log-ratio of probabilities:

\[r(x, y) = \beta \log \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)}\qquad{(2)}\]

This comes from deriving the Bradley-Terry reward with respect to an optimal policy (shown in eq. 16), as shown in TODO BT model. Essentially, the implicit reward model shows “the probability of human preference data in terms of the optimal policy rather than the reward model.”

Let us consider the loss shown in eq. 1. The learning process is decreasing the loss. Here, the loss will be lower when the log-ratio of the chosen response is bigger than the log-ratio of the rejected response (normalized by the reference model). In practice, this is a sum of log-probabilities of the model across the sequence of tokens in the data presented. Hence, DPO is increasing the delta in probabilities between the chosen and rejected responses.

With the reward in eq. 2, we can write the gradient of the loss to further interpret what is going on:

\[\nabla_{\theta}\mathcal{L}_{\text{DPO}}(\pi_{\theta}; \pi_{\text{ref}}) = -\beta \mathbb{E}_{(x, y_c, y_r)\sim \mathcal{D}}\left[ \sigma\left(r_{\theta}(x, y_r) - r_{\theta}(x, y_c)\right) \left(\nabla_{\theta}\log \pi(y_c \mid x) - \nabla_{\theta}\log \pi(y_r \mid x)\right) \right] \qquad{(3)}\]

Here, the gradient solves the above objective by doing the following:

The core intuition is that DPO is “fitting an implicit reward model whose corresponding optimal policy can be extracted in a closed form” (thanks to gradient ascent and our ML tools). What is often misunderstood is that DPO is learning a reward model at its core, hence the subtitle of the paper Your Language Model is Secretly a Reward Model. It is easy to confuse this with the DPO objective training a policy directly, hence studying the derivations below are good for a complete understanding.

With the implicit reward model learning, DPO is generating an optimal solution to the RLHF objective given the data in the dataset and the specific KL constraint in the objective \(\beta\). Here, DPO solves for the exact policy given a specific KL distance because the generations are not online as in policy gradient algorithms – a core difference from the RL methods for preference tuning. In many ways, this makes the \(\beta\) value easier to tune with DPO relative to online RL methods, but crucially and intuitively the optimal value depends on the model being trained and the data training it.

At each batch of preference data, composed of many pairs of completions \(y_{chosen} \succ y_{rejected}\), DPO takes gradient steps directly towards the optimal solution. It is far simpler than policy gradient methods.

Figure 1: DPO simplicity meme, credit Tom Goldstein.

DPO Derivation

The DPO derivation takes two primary parts. First, the authors show the form of the policy that optimally solved the RLHF objective used throughout this book. Next, they show how to arrive at that solution from pairwise preference data (i.e. a Bradley Terry model).

1. Deriving the Optimal RLHF Solution

To start, we should consider the RLHF optimization objective once again, here indicating we wish to maximize this quantity:

\[ \max_{\pi} \mathbb{E}_{\tau \sim \pi} \left[r_\theta(s_t, a_t)\right] - \beta \mathcal{D}_{KL}(\pi^{\text{RL}}(\cdot|s_t) \| \pi^{\text{ref}}(\cdot|s_t)).\qquad{(4)}\]

First, let us expand the definition of KL-divergence,

\[\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}}\mathbb{E}_{y \sim \pi(y|x)}\left[r(x,y)-\beta\log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right] \qquad{(5)}\]

Next, pull the negative sign out of the difference in brackets. To do this, split it into two terms:

\[ = \max_{\pi}\left(\mathbb{E}_{x \sim \mathcal{D}}\mathbb{E}_{y \sim \pi(y|x)}[r(x,y)] - \beta\,\mathbb{E}_{x \sim \mathcal{D}}\mathbb{E}_{y \sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\right]\right) \qquad{(6)}\]

Then, remove the factor of \(-1\) and \(\beta\),

\[ = \min_{\pi}\left(-\mathbb{E}_{x \sim \mathcal{D}}\mathbb{E}_{y \sim \pi(y|x)}[r(x,y)] + \beta\,\mathbb{E}_{x \sim \mathcal{D}}\mathbb{E}_{y \sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi_{\mathrm{ref}}(y|x)}\right]\right) \qquad{(7)}\]

Divide by \(\beta\) and recombine:

\[ = \min_{\pi}\left(\mathbb{E}_{x \sim \mathcal{D}}\mathbb{E}_{y \sim \pi(y|x)}\left[ \log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - \frac{1}{\beta}r(x,y) \right]\right) \qquad{(8)}\]

Next, we must introduce a partition function, \(Z(x)\):

\[ Z(x) = \sum_y \pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right) \qquad{(9)}\]

The partition function acts as a normalization factor over the reference policy, summing over all possible responses \(y\) to a prompt \(x\). With this substituted in, we obtain our intermediate transformation:

\[ \min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\mathbb{E}_{y\sim\pi(y|x)}\left[\log\frac{\pi(y|x)}{\frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right)} - \log Z(x)\right] \qquad{(10)}\]

To see how this is obtained, consider the internal part of the optimization in brackets of eq. 8:

\[ \log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - \frac{1}{\beta}r(x,y) \qquad{(11)}\]

Then, add \(\log Z(x) - \log Z(x)\) to both sides:

\[ = \log\frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - \frac{1}{\beta}r(x,y) + \log Z(x) - \log Z(x) \qquad{(12)}\]

Then, we group the terms:

\[ = \left( \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} + \log Z(x) \right) - \log Z(x) - \frac{1}{\beta}r(x,y) \qquad{(13)}\]

With \(\log(x) + \log(y) = \log(x\cdot y)\) (and moving \(Z\) to the denominator), we get:

\[ = \log \frac{\pi(y|x)}{\frac{1}{Z(x)}\pi_{\text{ref}}(y|x)}- \log Z(x) - \frac{1}{\beta}r(x,y) \qquad{(14)}\]

Next, we expand \(\frac{1}{\beta}r(x,y)\) to \(\log \exp \frac{1}{\beta}r(x,y)\) and do the same operation to get eq. 10. With this optimization form, we need to actually solve for the optimal policy \(\pi^*\). To do so, let us consider the above optimization as a KL distance:

\[ \min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{D}_\text{KL} \left(\pi(y|x)||\frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right) \right) - \log Z(x)\right] \qquad{(15)}\]

Since the partition function \(Z(x)\) does not depend on the final answer, we can ignore it. This leaves us with just the KL distance between our policy we are learning and a form relating the partition, \(\beta\), reward, and reference policy. The Gibb’s inequality tells this is minimized at a distance of 0, only when the two quantities are equal! Hence, we get an optimal policy:

\[ \pi^*(y|x) = \pi(y|x) = \frac{1}{Z(x)}\pi_{\text{ref}}(y|x)\exp\left(\frac{1}{\beta}r(x,y)\right) \qquad{(16)}\]

2. Deriving DPO Objective for Bradley Terry Models

To start, recall from Chapter 7 on Reward Modeling and Chapter 6 on Preference Data that a Bradley-Terry model of human preferences is formed as:

\[p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(r^*(y_1 \mid x)\right)}{\exp\left(r^*(x,y_1)\right) + \exp\left(r^*(x, y_2)\right)} \qquad{(17)}\]

By manipulating eq. 16 by taking the logarithm of both sides and performing some algebra, one can obtain the DPO reward as follows:

\[r^*(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)\qquad{(18)}\]

We then can substitute the reward into the Bradley-Terry equation shown in eq. 17 to obtain:

\[p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right)} {\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} + \beta \log Z(x)\right)} \qquad{(19)}\]

By decomposing the exponential expressions from \(e^{a+b}\) to \(e^a e^b\) and then cancelling out the terms \(e^{\log(Z(x))}\), this simplifies to:

\[p^*(y_1 \succ y_2 \mid x) = \frac{\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)} {\exp\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right) + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)}\right)} \qquad{(20)}\]

Then, multiple the numerator and denominator by \(\exp\left(-\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)\) to obtain:

\[p^*(y_1 \succ y_2 \mid x) = \frac{1}{1 + \exp\left(\beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)} - \beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)}\right)} \qquad{(21)}\]

Finally, with the definition of a sigmoid function as \(\sigma(x) = \frac{1}{1+e^{-x}}\), we obtain:

\[p^*(y_1 \succ y_2 \mid x) = \sigma\left(\beta \log \frac{\pi^*(y_1 \mid x)}{\pi_{\text{ref}}(y_1 \mid x)} - \beta \log \frac{\pi^*(y_2 \mid x)}{\pi_{\text{ref}}(y_2 \mid x)}\right) \qquad{(22)}\]

This is the loss function for DPO, as shown in eq. 1. The DPO paper has an additional derivation for the objective under a Plackett-Luce Model, which is far less used in practice [1].

3. Deriving the Bradley Terry DPO Gradient

We used the DPO gradient shown in eq. 3 to explain intuitions for how the model learns. To derive this, we must take the gradient of eq. 22 with respect to the model parameters.

\[\nabla_{\theta}\mathcal{L}_{\text{DPO}}(\pi_{\theta}; \pi_{\text{ref}}) = -\nabla_{\theta}\mathbb{E}_{(x,y_c,y_r)\sim\mathcal{D}}\left[ \log \sigma\left(\beta \log \frac{\pi_{\theta}(y_c|x)}{\pi_{\text{ref}}(y_c|x)} - \beta \log \frac{\pi_{\theta}(y_r|x)}{\pi_{\text{ref}}(y_r|x)}\right)\right] \qquad{(23)}\]

To start, this can be rewritten. We know that the derivative of a sigmoid function \(\frac{d}{dx} \sigma(x) = \sigma(x)(1-\sigma(x))\), the derivative of logarithm \(\frac{d}{dx} \log x = \frac{1}{x}\), and properties of sigmoid \(\sigma(-x)=1-\sigma(x)\), so we can reformat the above equation.

First, define the expression inside the sigmoid as \(u=\beta \log \frac{\pi_{\theta}(y_c|x)}{\pi_{\text{ref}}(y_c|x)} - \beta \log \frac{\pi_{\theta}(y_r|x)}{\pi_{\text{ref}}(y_r|x)}\). Then, we have

\[\nabla_{\theta}\mathcal{L}_{\text{DPO}}(\pi_{\theta};\pi_{\text{ref}}) = -\mathbb{E}_{(x, y_c, y_r)\sim \mathcal{D}}\left[\frac{\sigma'(u)}{\sigma(u)}\nabla_{\theta}u\right] \qquad{(24)}\]

Expanding this and using the above expressions for sigmoid and logarithms results in the gradient introduced earlier:

\[ -\mathbb{E}_{(x,y_c,y_r)\sim\mathcal{D}}\left[\beta\sigma\left(\beta\log\frac{\pi_{\theta}(y_r|x)}{\pi_{\text{ref}}(y_r|x)} - \beta\log\frac{\pi_{\theta}(y_c|x)}{\pi_{\text{ref}}(y_c|x)}\right)\left[\nabla_{\theta}\log\pi(y_c|x)-\nabla_{\theta}\log\pi(y_r|x)\right]\right] \qquad{(25)}\]

Numerical Concerns, Weaknesses, and Alternatives

Many variants of the DPO algorithm have been proposed to address weaknesses of DPO. For example, without rollouts where a reward model can rate generations, DPO treats every pair of preference data with equal weight. In reality, as seen in Chapter 6 on Preference Data, there are many ways of capturing preference data with a richer label than binary. Multiple algorithms have been proposed to re-balance the optimization away from treating each pair equally.

Some variants to DPO attempt to either improve the learning signal by making small changes to the loss or make the application more efficient by reducing memory usage.

Figure 2: Sketch of preference displacement in DPO.

One of the core issues apparent in DPO is that the optimization drives only to increase the margin between the probability of the chosen and rejected responses. Numerically, the model reduces the probabiltiy of both the chosen and rejected responses, but the rejected response is reduced by a greater extent as shown in fig. 2. Intuitively, it is not clear how this generalizes, but work has posited that it increases the probability of unaddressed for behaviors [13] [14]. Simple methods, such as Cal-DPO [15], adjust the optimization so that this preference displacement does not occur. In practice, the exact impact of this is not well known, but points are a potential reason why online methods can outperform vanilla DPO.

The largest other reason that is posited for DPO-like methods to have a lower ceiling on performance than online (RL based) RLHF methods is that the training signal comes from completions from previous or other models. Online variants that sample generations from the model, e.g. Online DPO [16], even with regular reward model relabelling of newly created creations Discriminator-Guided DPO (D2PO) [17], alleviate these by generating new completions for the prompt and incorporating a preference signal at training time.

There is a long list of other DAA variants, such as Direct Nash Optimization (DNO) [18] or Binary Classifier Optimization (BCO) [19], but the choice of algorithm is far less important than the initial model and the data used [5] [20] [21].

Implementation Considerations

DAAs such as DPO are implemented very differently than policy gradient optimizers. The DPO loss, taken from the original implementation, largely can be summarized as follows [1]:

pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps

logits = pi_logratios - ref_logratios  # also known as h_{\pi_\theta}^{y_w,y_l}

losses = -F.logsigmoid(beta * logits)

chosen_rewards = beta * (policy_chosen_logps - reference_chosen_logps).detach()
rejected_rewards = beta * (policy_rejected_logps - reference_rejected_logps).detach()

This can be used in standard language model training stacks as this information is already collated during the forward pass of a model (with the addition of a reference model).

In most ways, this is simpler and an quality of life improvement, but also they offer a different set of considerations.

  1. KL distance is static: In DPO and other algorithms, the KL distance is set explicitly by the \(\beta\) parameter that balances the distance penalty to the optimization. This is due to the fact that DPO takes gradient steps towards the optimal solution to the RLHF objective given the data – it steps exactly to the solution set by the \(\beta\) term. On the other hand, RL based optimizers take steps based on the batch and recent data.
  2. Caching log-probabilities: Simple implementations of DPO do the forward passes for the policy model and reference models at the same time for conveniences with respect to the loss function. Though, this doubles the memory used and results in increased GPU usage. To avoid this, one can compute the log-probabilities of the reference model over the training dataset first, then reference it when computing the loss and updating the parameters per batch, reducing the peak memory usage by 50%.

DAAs vs. RL: Online vs. Offline Data

Broadly, the argument boils down to one question: Do we need the inner workings of reinforcement learning, with value functions, policy gradients, and all, to align language models with RLHF? This, like most questions phrased this way, is overly simplistic. Of course, both methods are well-established, but it is important to illustrate where the fundamental differences and performance manifolds lie.

Multiple reports have concluded that policy-gradient based and RL methods outperform DPO and its variants. The arguments take different forms, from training models with different algorithms but controlled data[22] [23] or studying the role of on-policy data within the RL optimization loop [24]. In all of these cases, DPO algorithms are a hair behind.

Even with this performance delta, DAA are still used extensively in leading models due to its simplicity. DAAs provide a controlled environment where iterations on training data and other configurations can be made rapidly, and given that data is often far more important than algorithms, using DPO can be fine.

With the emergence of reasoning models that are primarily trained with RL, further investment will return to using RL for preference-tuning, which in the long-term will improve the robustness of RL infrastructure and cement this margin between DAAs and RL for optimizing from human feedback.

Bibliography

[1]
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[2]
L. Tunstall et al., “Zephyr: Direct distillation of lm alignment,” arXiv preprint arXiv:2310.16944, 2023.
[3]
A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
[4]
H. Ivison et al., “Camels in a changing climate: Enhancing lm adaptation with tulu 2,” arXiv preprint arXiv:2311.10702, 2023.
[5]
N. Lambert et al., “T\(\backslash\)" ULU 3: Pushing frontiers in open language model post-training,” arXiv preprint arXiv:2411.15124, 2024.
[6]
B. Adler et al., “Nemotron-4 340B technical report,” arXiv preprint arXiv:2406.11704, 2024.
[7]
Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh, and P. J. Liu, “Slic-hf: Sequence likelihood calibration with human feedback,” arXiv preprint arXiv:2305.10425, 2023.
[8]
Z. Gao et al., “Rebel: Reinforcement learning via regressing relative rewards,” arXiv preprint arXiv:2404.16767, 2024.
[9]
M. G. Azar et al., “A general theoretical paradigm to understand learning from human preferences,” in International conference on artificial intelligence and statistics, PMLR, 2024, pp. 4447–4455.
[10]
A. Amini, T. Vieira, and R. Cotterell, “Direct preference optimization with an offset,” arXiv preprint arXiv:2402.10571, 2024.
[11]
J. Hong, N. Lee, and J. Thorne, “Reference-free monolithic preference optimization with odds ratio,” arXiv e-prints, pp. arXiv–2403, 2024.
[12]
Y. Meng, M. Xia, and D. Chen, “Simpo: Simple preference optimization with a reference-free reward,” Advances in Neural Information Processing Systems, vol. 37, pp. 124198–124235, 2025.
[13]
N. Razin, S. Malladi, A. Bhaskar, D. Chen, S. Arora, and B. Hanin, “Unintentional unalignment: Likelihood displacement in direct preference optimization,” arXiv preprint arXiv:2410.08847, 2024.
[14]
Y. Ren and D. J. Sutherland, “Learning dynamics of llm finetuning,” arXiv preprint arXiv:2407.10490, 2024.
[15]
T. Xiao, Y. Yuan, H. Zhu, M. Li, and V. G. Honavar, “Cal-dpo: Calibrated direct preference optimization for language model alignment,” arXiv preprint arXiv:2412.14516, 2024.
[16]
S. Guo et al., “Direct language model alignment from online ai feedback,” arXiv preprint arXiv:2402.04792, 2024.
[17]
P. Singhal, N. Lambert, S. Niekum, T. Goyal, and G. Durrett, “D2po: Discriminator-guided dpo with response evaluation models,” arXiv preprint arXiv:2405.01511, 2024.
[18]
C. Rosset, C.-A. Cheng, A. Mitra, M. Santacroce, A. Awadallah, and T. Xie, “Direct nash optimization: Teaching language models to self-improve with general preferences,” arXiv preprint arXiv:2404.03715, 2024.
[19]
S. Jung, G. Han, D. W. Nam, and K.-W. On, “Binary classifier optimization for large language model alignment,” arXiv preprint arXiv:2404.04656, 2024.
[20]
H. Zhao et al., “Rainbowpo: A unified framework for combining improvements in preference optimization,” arXiv preprint arXiv:2410.04203, 2024.
[21]
A. Gorbatovski, B. Shaposhnikov, V. Sinii, A. Malakhov, and D. Gavrilov, “The differences between direct alignment algorithms are a blur,” arXiv preprint arXiv:2502.01237, 2025.
[22]
H. Ivison et al., “Unpacking DPO and PPO: Disentangling best practices for learning from preference feedback,” arXiv preprint arXiv:2406.09279, 2024.
[23]
S. Xu et al., “Is dpo superior to ppo for llm alignment? A comprehensive study,” arXiv preprint arXiv:2404.10719, 2024.
[24]
F. Tajwar et al., “Preference fine-tuning of llms should leverage suboptimal, on-policy data,” arXiv preprint arXiv:2404.14367, 2024.
← Previous: Policy Gradients Next: Constitutional AI →