The Basics of Reinforcement Learning from Human Feedback

Chapter Contents

Reward Modeling

Reward models are core to the modern approach to RLHF. Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [1]. The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent’s reward function given trajectories of behavior [2], and other areas of deep reinforcement learning. Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [3].

Training Reward Models

There are two popular expressions for how to train a reward model – they are numerically equivalent. The canonical implementation is derived from the Bradley-Terry model of preference [4]. A Bradley-Terry model of preferences measures the probability that the pairwise comparison for two events drawn from the same distribution, say \(i\) and \(j\), satisfy the following relation, \(i > j\):

\[P(i > j) = \frac{p_i}{p_i + p_j}\qquad{(1)}\]

To train a reward model, we must formulate a loss function that satisfies the above relation. The first structure applied is to convert a language model into a model that outputs a scalar value, often in the form of a single classification probability logit. Thus, we can take the score of this model with two samples, the \(i\) and \(j\) above are now completions, \(y_1\) and \(y_2\), to one prompt, \(x\) and score both of them with respect to the above model, \(r_\theta\).

The probability of success for a given reward model in a pairwise comparison, becomes:

\[P(y_1 > y_2) = \frac{\exp(r(y_1))}{\exp(r(y_1)) + \exp(r(y_2))}\qquad{(2)}\]

Then, by taking the gradient with respect to the model parameters, we can arrive at the loss function to train a reward model. The first form, as in [5] and other works: \[\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)\qquad{(3)}\]

Second, as in [6] and other works: \[\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(x, y_l)} - e^{r_{\theta}(x, y_w)} \right)\qquad{(4)}\]

Implementation Example

Implementing the reward modeling loss is quite simple. More of the implementation challenge is on setting up a separate data loader and inference pipeline. Given the correct dataloader, the loss is implemented as:

import torch.nn as nn
rewards_chosen = model(**inputs_chosen)
rewards_rejected = model(**inputs_rejected)

loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()

Variants

Reward modeling is a relatively under-explored area of RLHF. The traditional reward modeling loss has been modified in many popular works, but the modifications have no solidified into a single best practice.

Preference Margin Loss

In the case where annotators are providing either scores or rankings on a Likert Scale, the magnitude of the relational quantities can be used in training. The most common practice is to binarize the data direction, implicitly scores of 1 and 0, but the additional information has been used to improve model training. Llama 2 proposes using the margin between two datapoints, \(m(r)\), to distinguish the magnitude of preference:

\[\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) - m(r) \right) \right)\qquad{(5)}\]

Balancing Multiple Comparisons Per Prompt

InstructGPT studies the impact of using a variable number of completions per prompt, yet balancing them in the reward model training [5]. To do this, they weight the loss updates per comparison per prompt. At an implementation level, this can be done automatically by including all examples with the same prompt in the same training batch, naturally weighing the different pairs – not doing this caused overfitting to the prompts. The loss function becomes:

\[\mathcal{L}(\theta) = - \frac{1}{(\frac{K}{2})} \mathbb{E}_{(x, y_w, y_l)\sim D} \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)\qquad{(6)}\]

K-wise loss function

Starling [7] https://arxiv.org/abs/2301.11270

Generative Reward Modeling

[8], [9], [10], generative and classifer [11]

Related to LLM-as-a-judge and other evaluator models, which are very popular

Further Reading

reward modeling reading list imo

RewardBench (biased, but gives a good overview): [10] [12]

New reward model training methods, with aspect-conditioned models [13], high quality human datasets [14] [15], scaling [16], extensive experimentation [17], debiasing data [18],

Recommendations

Strong tendency in the literature to train for only one epoch, otherwise it overfits

Bibliography

[1]
R. S. Sutton, “Reinforcement learning: An introduction,” A Bradford Book, 2018.
[2]
A. Y. Ng, S. Russell, et al., “Algorithms for inverse reinforcement learning.” in Icml, 2000, p. 2.
[3]
J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, “Scalable agent alignment via reward modeling: A research direction,” arXiv preprint arXiv:1811.07871, 2018.
[4]
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. The method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952, Accessed: Feb. 13, 2023. [Online]. Available: http://www.jstor.org/stable/2334029
[5]
L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
[6]
A. Askell et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.
[7]
B. Zhu, M. Jordan, and J. Jiao, “Principled reinforcement learning with human feedback from pairwise or k-wise comparisons,” in International conference on machine learning, PMLR, 2023, pp. 43037–43067.
[8]
D. Mahan et al., “Generative reward models,” 2024, Available: https://www.synthlabs.ai/pdf/Generative_Reward_Models.pdf
[9]
L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal, “Generative verifiers: Reward modeling as next-token prediction,” arXiv preprint arXiv:2408.15240, 2024.
[10]
N. Lambert, T. K. Gilbert, and T. Zick, “Entangled preferences: The history and risks of reinforcement learning and human feedback,” arXiv preprint arXiv:2310.13595, 2023.
[11]
Z. Ankner, M. Paul, B. Cui, J. D. Chang, and P. Ammanabrolu, “Critique-out-loud reward models,” arXiv preprint arXiv:2408.11791, 2024.
[12]
E. Zhou et al., “RMB: Comprehensively benchmarking reward models in LLM alignment,” arXiv preprint arXiv:2410.09893, 2024.
[13]
H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang, “Interpretable preferences via multi-objective reward modeling and mixture-of-experts,” arXiv preprint arXiv:2406.12845, 2024.
[14]
Z. Wang et al., “HelpSteer2: Open-source dataset for training top-performing reward models,” arXiv preprint arXiv:2406.08673, 2024.
[15]
Z. Wang et al., “HelpSteer2-preference: Complementing ratings with preferences,” arXiv preprint arXiv:2410.01257, 2024.
[16]
B. Adler et al., “Nemotron-4 340B technical report,” arXiv preprint arXiv:2406.11704, 2024.
[17]
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[18]
J. Park, S. Jwa, M. Ren, D. Kim, and S. Choi, “Offsetbias: Leveraging debiased data for tuning evaluators,” arXiv preprint arXiv:2407.06551, 2024.