A Little Bit of Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Definitions & Background

This chapter includes all the definitions, symbols, and operations frequently used in the RLHF process and with a quick overview of language models (the common optimization target of this book).

Language Modeling Overview

The majority of modern language models are trained to learn the joint probability distribution of sequences of tokens (words, subwords, or characters) in a autoregressive manner. Autoregression simply means that each next prediction depends on the previous entities in the sequence. Given a sequence of tokens \(x = (x_1, x_2, \ldots, x_T)\), the model factorizes the probability of the entire sequence into a product of conditional distributions:

\[P_{\theta}(x) = \prod_{t=1}^{T} P_{\theta}(x_{t} \mid x_{1}, \ldots, x_{t-1}).\qquad{(1)}\]

In order to fit a model that accurately predicts this, the goal is often to maximize the likelihood of the training data as predicted by the current model. To do so we can minimize a negative log-likelihood (NLL) loss:

\[\mathcal{L}_{\text{LM}}(\theta)=-\,\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T}\log P_{\theta}\left(x_t \mid x_{<t}\right)\right]. \qquad{(2)}\]

In practice, one uses a cross-entropy loss with respect to each next-token prediction, computed by comparing the true token in a sequence to what was predicted by the model.

Implementing a language model can take many forms. Modern LMs, including ChatGPT, Claude, Gemini, etc., most often use decoder-only Transformers [1]. The core innovation of the Transform was heavily utilizing the self-attention [2] mechanism to allow the model to directly attend to concepts in context and learn complex mappings. Throughout this book, particularly when covering reward models in Chapter 7, we will discuss adding new heads or modifying a language modeling (LM) head of the transformer. The LM head is a final linear projection layer that maps from the models internal embedding space to the tokenizer space (a.k.a. vocabulary). Different heads can be used to re-use the internals of the model and fine-tune it to output differently shaped quantities.

ML Definitions

\[ D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) \qquad{(3)}\]

NLP Definitions

RL Definitions

RLHF Only Definitions

Bibliography

[1]
A. Vaswani et al., “Attention is all you need,” in Neural information processing systems, 2017. Available: https://api.semanticscholar.org/CorpusID:13756489
[2]
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014, Available: https://api.semanticscholar.org/CorpusID:11212020
← Previous: Key Related Works Next: Training Overview →