A Little Bit of Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Reward Modeling

Reward models are core to the modern approach to RLHF. Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [1]. The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent’s reward function given trajectories of behavior [2], and other areas of deep reinforcement learning. Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [3].

Training Reward Models

There are two popular expressions for how to train a reward model – they are numerically equivalent. The canonical implementation is derived from the Bradley-Terry model of preference [4]. A Bradley-Terry model of preferences measures the probability that the pairwise comparison for two events drawn from the same distribution, say \(i\) and \(j\), satisfy the following relation, \(i > j\):

\[P(i > j) = \frac{p_i}{p_i + p_j}\qquad{(1)}\]

To train a reward model, we must formulate a loss function that satisfies the above relation. The first structure applied is to convert a language model into a model that outputs a scalar value, often in the form of a single classification probability logit. Thus, we can take the score of this model with two samples, the \(i\) and \(j\) above are now completions, \(y_1\) and \(y_2\), to one prompt, \(x\) and score both of them with respect to the above model, \(r_\theta\).

The probability of success for a given reward model in a pairwise comparison, becomes:

\[P(y_1 > y_2) = \frac{\exp(r(y_1))}{\exp(r(y_1)) + \exp(r(y_2))}\qquad{(2)}\]

Then, by taking the gradient with respect to the model parameters, we can arrive at the loss function to train a reward model. The first form, as in [5] and other works: \[\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)\qquad{(3)}\]

Second, as in [6] and other works: \[\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(x, y_l)} - e^{r_{\theta}(x, y_w)} \right)\qquad{(4)}\]

Architecture

The most common way reward models are implemented is through an abstraction similar to Transformer’s AutoModelForSequenceClassification, which appends a small linear head to the language model that performs classification between two outcomes – chosen and rejected. At inference time, the model outputs the probability that the piece of text is chosen as a single logit from the model.

Other implementation options exist, such as just taking a linear layer directly from the final embeddings, but they are less common in open tooling.

Implementation Example

Implementing the reward modeling loss is quite simple. More of the implementation challenge is on setting up a separate data loader and inference pipeline. Given the correct dataloader, the loss is implemented as:

import torch.nn as nn
rewards_chosen = model(**inputs_chosen)
rewards_rejected = model(**inputs_rejected)

loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()

Note, when training reward models, the most common practice is to train for only 1 epoch to avoid overfitting.

Variants

Reward modeling is a relatively under-explored area of RLHF. The traditional reward modeling loss has been modified in many popular works, but the modifications have no solidified into a single best practice.

Preference Margin Loss

In the case where annotators are providing either scores or rankings on a Likert Scale, the magnitude of the relational quantities can be used in training. The most common practice is to binarize the data direction, implicitly scores of 1 and 0, but the additional information has been used to improve model training. Llama 2 proposes using the margin between two datapoints, \(m(r)\), to distinguish the magnitude of preference:

\[\mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) - m(r) \right) \right)\qquad{(5)}\]

Balancing Multiple Comparisons Per Prompt

InstructGPT studies the impact of using a variable number of completions per prompt, yet balancing them in the reward model training [5]. To do this, they weight the loss updates per comparison per prompt. At an implementation level, this can be done automatically by including all examples with the same prompt in the same training batch, naturally weighing the different pairs – not doing this caused overfitting to the prompts. The loss function becomes:

\[\mathcal{L}(\theta) = - \frac{1}{(\frac{K}{2})} \mathbb{E}_{(x, y_w, y_l)\sim D} \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)\qquad{(6)}\]

K-wise Loss Function

There are many other formulations that can create suitable models of human preferences for RLHF. One such example, used in the popular, early RLHF’d models Starling 7B and 34B [7], is a K-wise loss function based on the Plackett-Luce model [8].

Following Zhu et al. 2023 formalizes the setup [9], following as follows. With a prompt, or state, \(s^i\), \(K\) actions \((a_0^i, a_1^i, \cdots, a_{K-1}^i)\) are sampled from \(P(a_0,\cdots,a_{K-1}|s^i)\). Then, labelers are used to rank preferences with \(\sigma^i: [K] \mapsto [K]\) is a function representing action rankings, where \(\sigma^i(0)\) is the most preferred action. This yields a preference model capturing the following:

\[P(\sigma^i|s^i,a_0^i,a_1^i,\ldots,a_{K-1}^i) = \prod_{k=0}^{K-1} \frac{\exp(r_{\theta\star}(s^i,a_{\sigma^i(k)}^i))}{\sum_{j=k}^{K-1}\exp(r_{\theta\star}(s^i,a_{\sigma^i(j)}^i))}\]

When \(K = 2\), this reduces to the Bradley-Terry (BT) model for pairwise comparisons. Regardless, once trained, these models are used similarly to other reward models during RLHF training.

Outcome Reward Models

The majority of preference tuning for language models and other AI systems is done with the Bradley Terry models discussed above. For reasoning heavy tasks, one can use an Outcome Reward Model (ORM). The training data for an ORM is constructed in a similar manner to standard preference tuning. Here, we have a problem statement or prompt, \(x\) and two completions \(y_1\) and \(y_2\). The inductive bias used here is that one completion should be a correct solution to the problem and one incorrect, resulting in \((y_c,y_{ic})\).

The shape of the models used is very similar to a standard reward model, with a linear layer appended to a model that can output a single logit (in the case of an RM) – with an ORM, the training objective that follows is slightly different [10]:

[We] train verifiers with a joint objective where the model learns to label a model completion as correct or incorrect, in addition to the original language modeling objective. Architecturally, this means our verifiers are language models, with a small scalar head that outputs predictions on a per-token basis. We implement this scalar head as a single bias parameter and single gain parameter that operate on the logits outputted by the language model’s final unembedding layer.

To translate, this is implemented as as a language modeling head that can predict two classes per token (1 for correct, 0 for incorrect), rather than a classification head of a traditional RM that outputs one token for the entire sequence.

These models have continued in use, but are less supported in open-source RLHF tools. For example, the same type of ORM was used in the seminal work Let’s Verify Step by Step [11], but without the language modeling prediction piece of the loss. Then, the final loss is a cross entropy loss on every token predicting if the final answer is correct.

Process Reward Models

Process Reward Models (PRMs), originally called Process-supervised Reward Models, are reward models trained to output scores at every step in a chain of thought reasoning process. These differ from a standard RM that outputs a score only at an EOS token or a ORM that outputs a score at every token. Process Reward Models require supervision at the end of each reasoning step, and then are trained similarly where the tokens in the step are trained to their relevant target – the target is the step in PRMs and the entire response for ORMs.

Here’s an example of how this per-step label can be packaged in a trainer, from HuggingFace’s TRL [12]:

# Get the ID of the separator token and add it to the completions
separator_ids = tokenizer.encode(step_separator, add_special_tokens=False)
completions_ids = [completion + separator_ids for completion in completions_ids]

# Create the label 
labels = [[-100] * (len(completion) - 1) + [label] for completion, label in zip(completions_ids, labels)]

TODO comment on how they are often trained with LM heads and have 3 classes, +, neutral, -

Generative Reward Modeling

With the cost of preference data, a large research area emerged to use existing language models as a judge of human preferences or in other evaluation settings [13]. The core idea is to prompt a language model with instructions on how to judge, a prompt, and two completions (much as would be done with human labelers). An example prompt, from one of the seminal works here for the chat evaluation MT-Bench [13], follows:

[System]
Please act as an impartial judge and evaluate the quality of the responses provided by two
AI assistants to the user question displayed below. You should choose the assistant that
follows the user’s instructions and answers the user’s question better. Your evaluation
should consider factors such as the helpfulness, relevance, accuracy, depth, creativity,
and level of detail of their responses. Begin your evaluation by comparing the two
responses and provide a short explanation. Avoid any position biases and ensure that the
order in which the responses were presented does not influence your decision. Do not allow
the length of the responses to influence your evaluation. Do not favor certain names of
the assistants. Be as objective as possible. After providing your explanation, output your
final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]"
if assistant B is better, and "[[C]]" for a tie.
[User Question]
{question}
[The Start of Assistant A’s Answer]
{answer_a}
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
{answer_b}
[The End of Assistant B’s Answer]

Given the efficacy of LLM-as-a-judge for evaluation, spawning many other evaluations such as AlpacaEval [14], Arena-Hard [15], and WildBench [16], many began using LLM-as-a-judge instead of reward models to create and use preference data.

An entire field of study has emerged to study how to use so called “Generative Reward Models” [17] [18] [19] (including models trained specifically to be effective judges [20]), but on RM evaluations they tend to be behind existing reward models, showing that reward modeling is an important technique for current RLHF.

Further Reading

The academic literature for reward modeling established itself in 2024. The bulk of progress in reward modeling early on has been in establishing benchmarks and identifying behavior modes. The first RM benchmark, RewardBench, provided common infrastructure for testing reward models [21]. Since then, RM evaluation has expanded to be similar to the types of evaluations available to general post-trained models, where some evaluations test the accuracy of prediction on domains with known true answers [21] or those more similar to “vibes” performed with LLM-as-a-judge or correlations to other benchmarks [22].

Examples of new benchmarks include multilingual reward bench (M-RewardBench) [23], RAG-RewardBench [24], RM-Bench [25], Preference Proxy Evaluations [26], and RewardMATH [27].

To understand progress on training reward models, one can reference new reward model training methods, with aspect-conditioned models [28], high quality human datasets [29] [30], scaling [31], extensive experimentation [32], or debiasing data [33].

Bibliography

[1]
R. S. Sutton, “Reinforcement learning: An introduction,” A Bradford Book, 2018.
[2]
A. Y. Ng, S. Russell, et al., “Algorithms for inverse reinforcement learning.” in Icml, 2000, p. 2.
[3]
J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, “Scalable agent alignment via reward modeling: A research direction,” arXiv preprint arXiv:1811.07871, 2018.
[4]
R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. The method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952, Accessed: Feb. 13, 2023. [Online]. Available: http://www.jstor.org/stable/2334029
[5]
L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
[6]
A. Askell et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.
[7]
B. Zhu et al., “Starling-7b: Improving helpfulness and harmlessness with rlaif,” in First conference on language modeling, 2024.
[8]
A. Liu, Z. Zhao, C. Liao, P. Lu, and L. Xia, “Learning plackett-luce mixtures from partial preferences,” in Proceedings of the AAAI conference on artificial intelligence, 2019, pp. 4328–4335.
[9]
B. Zhu, M. Jordan, and J. Jiao, “Principled reinforcement learning with human feedback from pairwise or k-wise comparisons,” in International conference on machine learning, PMLR, 2023, pp. 43037–43067.
[10]
K. Cobbe et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
[11]
H. Lightman et al., “Let’s verify step by step,” arXiv preprint arXiv:2305.20050, 2023.
[12]
L. von Werra et al., “TRL: Transformer reinforcement learning,” GitHub repository. https://github.com/huggingface/trl; GitHub, 2020.
[13]
L. Zheng et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46595–46623, 2023.
[14]
Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto, “Length-controlled alpacaeval: A simple way to debias automatic evaluators,” arXiv preprint arXiv:2404.04475, 2024.
[15]
T. Li et al., “From crowdsourced data to high-quality benchmarks: Arena-hard and BenchBuilder pipeline,” arXiv preprint arXiv:2406.11939, 2024.
[16]
B. Y. Lin et al., “WILDBENCH: Benchmarking LLMs with challenging tasks from real users in the wild,” arXiv preprint arXiv:2406.04770, 2024.
[17]
D. Mahan et al., “Generative reward models,” 2024, Available: https://www.synthlabs.ai/pdf/Generative_Reward_Models.pdf
[18]
L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal, “Generative verifiers: Reward modeling as next-token prediction,” arXiv preprint arXiv:2408.15240, 2024.
[19]
Z. Ankner, M. Paul, B. Cui, J. D. Chang, and P. Ammanabrolu, “Critique-out-loud reward models,” arXiv preprint arXiv:2408.11791, 2024.
[20]
S. Kim et al., “Prometheus: Inducing fine-grained evaluation capability in language models,” in The twelfth international conference on learning representations, 2023.
[21]
N. Lambert et al., “Rewardbench: Evaluating reward models for language modeling,” arXiv preprint arXiv:2403.13787, 2024.
[22]
X. Wen et al., “Rethinking reward model evaluation: Are we barking up the wrong tree?” arXiv preprint arXiv:2410.05584, 2024.
[23]
S. Gureja et al., “M-RewardBench: Evaluating reward models in multilingual settings,” arXiv preprint arXiv:2410.15522, 2024.
[24]
Z. Jin et al., “RAG-RewardBench: Benchmarking reward models in retrieval augmented generation for preference alignment,” arXiv preprint arXiv:2412.13746, 2024.
[25]
E. Zhou et al., “RMB: Comprehensively benchmarking reward models in LLM alignment,” arXiv preprint arXiv:2410.09893, 2024.
[26]
E. Frick et al., “How to evaluate reward models for RLHF,” arXiv preprint arXiv:2410.14872, 2024.
[27]
S. Kim et al., “Evaluating robustness of reward models for mathematical reasoning,” arXiv preprint arXiv:2410.01729, 2024.
[28]
H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang, “Interpretable preferences via multi-objective reward modeling and mixture-of-experts,” arXiv preprint arXiv:2406.12845, 2024.
[29]
Z. Wang et al., “HelpSteer2: Open-source dataset for training top-performing reward models,” arXiv preprint arXiv:2406.08673, 2024.
[30]
Z. Wang et al., “HelpSteer2-preference: Complementing ratings with preferences,” arXiv preprint arXiv:2410.01257, 2024.
[31]
B. Adler et al., “Nemotron-4 340B technical report,” arXiv preprint arXiv:2406.11704, 2024.
[32]
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[33]
J. Park, S. Jwa, M. Ren, D. Kim, and S. Choi, “Offsetbias: Leveraging debiased data for tuning evaluators,” arXiv preprint arXiv:2407.06551, 2024.
← Previous: Preference Data Next: Regularization →