Definitions & Background

This chapter includes all the definitions, symbols, and operations frequently used in the RLHF process and with a quick overview of language models (the common optimization target of this book).

Language Modeling Overview

The majority of modern language models are trained to learn the joint probability distribution of sequences of tokens (words, subwords, or characters) in an autoregressive manner. Autoregression simply means that each next prediction depends on the previous entities in the sequence. Given a sequence of tokens \(x = (x_1, x_2, \ldots, x_T)\), the model factorizes the probability of the entire sequence into a product of conditional distributions:

\[P_{\theta}(x) = \prod_{t=1}^{T} P_{\theta}(x_{t} \mid x_{1}, \ldots, x_{t-1}).\qquad{(1)}\]

In order to fit a model that accurately predicts this, the goal is often to maximize the likelihood of the training data as predicted by the current model. To do so we can minimize a negative log-likelihood (NLL) loss:

\[\mathcal{L}_{\text{LM}}(\theta)=-\,\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T}\log P_{\theta}\left(x_t \mid x_{<t}\right)\right]. \qquad{(2)}\]

In practice, one uses a cross-entropy loss with respect to each next-token prediction, computed by comparing the true token in a sequence to what was predicted by the model.

Implementing a language model can take many forms. Modern LMs, including ChatGPT, Claude, Gemini, etc., most often use decoder-only Transformers [1]. The core innovation of the Transformer was heavily utilizing the self-attention [2] mechanism to allow the model to directly attend to concepts in context and learn complex mappings. Throughout this book, particularly when covering reward models in Chapter 7, we will discuss adding new heads or modifying a language modeling (LM) head of the transformer. The LM head is a final linear projection layer that maps from the models internal embedding space to the tokenizer space (a.k.a. vocabulary). Different heads can be used to re-use the internals of the model and fine-tune it to output differently shaped quantities.

ML Definitions

Kullback-Leibler (KL) divergence (\(D_{KL}(P || Q)\)), also known as KL divergence, is a measure of the difference between two probability distributions. For discrete probability distributions \(P\) and \(Q\) defined on the same probability space \(\mathcal{X}\), the KL distance from \(Q\) to \(P\) is defined as:

\[ D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) \qquad{(3)}\]

NLP Definitions

Prompt (\(x\)): The input text given to a language model to generate a response or completion.
Completion (\(y\)): The output text generated by a language model in response to a prompt. Often the completion is denoted as \(y|x\).
Chosen Completion (\(y_c\)): The completion that is selected or preferred over other alternatives, often denoted as \(y_{chosen}\).
Rejected Completion (\(y_r\)): The disfavored completion in a pairwise setting.
Preference Relation (\(\succ\)): A symbol indicating that one completion is preferred over another, e.g., \(y_{chosen} \succ y_{rejected}\).
Policy (\(\pi\)): A probability distribution over possible completions, parameterized by \(\theta\): \(\pi_\theta(y|x)\).

RL Definitions

Reward (\(r\)): A scalar value indicating the desirability of an action or state, typically denoted as \(r\).
Action (\(a\)): A decision or move made by an agent in an environment, often represented as \(a \in A\), where \(A\) is the set of possible actions.
State (\(s\)): The current configuration or situation of the environment, usually denoted as \(s \in S\), where \(S\) is the state space.
Trajectory (\(\tau\)): A trajectory \(\tau\) is a sequence of states, actions, and rewards experienced by an agent: \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, ..., s_T, a_T, r_T)\).
Trajectory Distribution (\((\tau|\pi)\)): The probability of a trajectory under policy \(\pi\) is \(P(\tau|\pi) = p(s_0)\prod_{t=0}^T \pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\), where \(p(s_0)\) is the initial state distribution and \(p(s_{t+1}|s_t,a_t)\) is the transition probability.
Policy (\(\pi\)), also called the policy model in RLHF: In RL, a policy is a strategy or rule that the agent follows to decide which action to take in a given state: \(\pi(a|s)\).
Discount Factor (\(\gamma\)): A scalar \(0 \le \gamma < 1\) that exponentially down‑weights future rewards in the return, trading off immediacy versus long‑term gain and guaranteeing convergence for infinite‑horizon sums. Sometimes discounting is not used, which is equivalent to \(\gamma=1\).
Value Function (\(V\)): A function that estimates the expected cumulative reward from a given state: \(V(s) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s]\).
Q-Function (\(Q\)): A function that estimates the expected cumulative reward from taking a specific action in a given state: \(Q(s,a) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s, a_0 = a]\).
Advantage Function (\(A\)): The advantage function \(A(s,a)\) quantifies the relative benefit of taking action \(a\) in state \(s\) compared to the average action. It’s defined as \(A(s,a) = Q(s,a) - V(s)\). Advantage functions (and value functions) can depend on a specific policy, \(A^\pi(s,a)\).
Policy-conditioned Values (\([]^{\pi(\cdot)}\)): Across RL derivations and implementations, a crucial component of the theory and practice is collecting data or values conditioned on a specific policy. Throughout this book we will switch between the simpler notation of value functions et al. (\(V,A,Q,G\)) and their specific policy-conditioned values (\(V^\pi,A^\pi,Q^\pi\)). Crucial is also in the expected value computation is sampling from data \(d\), that is conditioned on a specific policy, \(d_\pi\).
Expectation of Reward Optimization: The primary goal in RL, which involves maximizing the expected cumulative reward:

\[\max_{\theta} \mathbb{E}_{s \sim \rho_\pi, a \sim \pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]\qquad{(4)}\]

where \(\rho_\pi\) is the state distribution under policy \(\pi\), and \(\gamma\) is the discount factor.
Finite Horizon Reward (\(J(\pi_\theta)\)): The expected finite-horizon discounted return of the policy \(\pi_\theta\), parameterized by \(\theta\) is defined as:

\[J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r_t \right]\qquad{(5)}\]

where \(\tau \sim \pi_\theta\) denotes trajectories sampled by following policy \(\pi_\theta\) and \(T\) is the finite horizon.
On-policy: In RLHF, particularly in the debate between RL and Direct Alignment Algorithms, the discussion of on-policy data is common. In the RL literature, on-policy means that the data is generated exactly by the current form of the agent, but in the general preference-tuning literature, on-policy is expanded to mean generations from that edition of model – e.g. a instruction tuned checkpoint before running any preference fine-tuning. In this context, off-policy could be data generated by any other language model being used in post-training.

RLHF Only Definitions

Reference Model (\(\pi_\text{ref}\)): This is a saved set of parameters used in RLHF where outputs of it are used to regularize the optimization.

Extended Glossary

Synthetic Data: This is any training data for an AI model that is the output from another AI system. This could be anything from text generated from a open-ended prompt of a model to a model re-writing existing content.
Distillation: Distillation is a general set of practices in training AI models where a model is trained on the outputs of a stronger model. This is a type of synthetic data known to make strong, smaller models. Most models make the rules around distillation clear through either the license, for open weight models, or the terms of service, for models accessible only via API. The term distillation is now overloaded with a specific technical definition from the ML literature.
(Teacher-student) Knowledge Distillation: Knowledge distillation from a specific teacher to student model is a specific type of distillation above and where the term originated. It is a specific deep learning method where a neural network loss is modified to learn from the log-probabilites of the teacher model over multiple potential tokens/logits, instead of learning directly from a chosen output [3]. An example of a modern series of models trained with Knowledge Distillation is Gemma 2 [4] or Gemma 3. For a language modeling setup, the next-token loss function can be modified as follows [5], where the student model \(P_\theta\) learns from the teacher distribution \(P_\phi\):

\[\mathcal{L}_{\text{KD}}(\theta) = -\,\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T} P_{\phi}(x_t \mid x_{<t}) \log P_{\theta}(x_t \mid x_{<t})\right]. \qquad{(6)}\]

In-context Learning (ICL): In-context here refers to any information within the context window of the language model. Usually, this is information added to the prompt. The simplest form of in-context learning is adding examples of a similar form before the prompt. Advanced versions can learn which information to include for a specific use-case.
Chain of Thought (CoT): Chain of thought is a specific behavior of language models where they are steered towards a behavior that breaks down a problem in a step by step form. The original version of this was through the prompt “Let’s think step by step” [6].

Bibliography

[1]

A. Vaswani et al., “Attention is all you need,” in Neural information processing systems, 2017. Available: https://api.semanticscholar.org/CorpusID:13756489

[2]

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014, Available: https://api.semanticscholar.org/CorpusID:11212020

[3]

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[4]

G. Team et al., “Gemma 2: Improving open language models at a practical size,” arXiv preprint arXiv:2408.00118, 2024.

[5]

R. Agarwal et al., “On-policy distillation of language models: Learning from self-generated mistakes,” in The twelfth international conference on learning representations, 2024.

[6]

J. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24824–24837, 2022.

Reinforcement Learning from Human Feedback

Chapter Contents