Definitions & Background
This chapter includes all the definitions, symbols, and operations frequently used in the RLHF process and with a quick overview of language models (the common optimization target of this book).
Language Modeling Overview
The majority of modern language models are trained to learn the joint probability distribution of sequences of tokens (words, subwords, or characters) in a autoregressive manner. Autoregression simply means that each next prediction depends on the previous entities in the sequence. Given a sequence of tokens \(x = (x_1, x_2, \ldots, x_T)\), the model factorizes the probability of the entire sequence into a product of conditional distributions:
\[P_{\theta}(x) = \prod_{t=1}^{T} P_{\theta}(x_{t} \mid x_{1}, \ldots, x_{t-1}).\qquad{(1)}\]
In order to fit a model that accurately predicts this, the goal is often to maximize the likelihood of the training data as predicted by the current model. To do so we can minimize a negative log-likelihood (NLL) loss:
\[\mathcal{L}_{\text{LM}}(\theta)=-\,\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T}\log P_{\theta}\left(x_t \mid x_{<t}\right)\right]. \qquad{(2)}\]
In practice, one uses a cross-entropy loss with respect to each next-token prediction, computed by comparing the true token in a sequence to what was predicted by the model.
Implementing a language model can take many forms. Modern LMs, including ChatGPT, Claude, Gemini, etc., most often use decoder-only Transformers [1]. The core innovation of the Transform was heavily utilizing the self-attention [2] mechanism to allow the model to directly attend to concepts in context and learn complex mappings. Throughout this book, particularly when covering reward models in Chapter 7, we will discuss adding new heads or modifying a language modeling (LM) head of the transformer. The LM head is a final linear projection layer that maps from the models internal embedding space to the tokenizer space (a.k.a. vocabulary). Different heads can be used to re-use the internals of the model and fine-tune it to output differently shaped quantities.
ML Definitions
- Kullback-Leibler (KL) divergence (\(D_{KL}(P || Q)\)), also known as KL divergence, is a measure of the difference between two probability distributions. For discrete probability distributions \(P\) and \(Q\) defined on the same probability space \(\mathcal{X}\), the KL distance from \(Q\) to \(P\) is defined as:
\[ D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) \qquad{(3)}\]
NLP Definitions
Prompt (\(x\)): The input text given to a language model to generate a response or completion.
Completion (\(y\)): The output text generated by a language model in response to a prompt. Often the completion is denoted as \(y|x\).
Chosen Completion (\(y_c\)): The completion that is selected or preferred over other alternatives, often denoted as \(y_{chosen}\).
Rejected Completion (\(y_r\)): The disfavored completion in a pairwise setting.
Preference Relation (\(\succ\)): A symbol indicating that one completion is preferred over another, e.g., \(y_{chosen} \succ y_{rejected}\).
Policy (\(\pi\)): A probability distribution over possible completions, parameterized by \(\theta\): \(\pi_\theta(y|x)\).
RL Definitions
Reward (\(r\)): A scalar value indicating the desirability of an action or state, typically denoted as \(r\).
Action (\(a\)): A decision or move made by an agent in an environment, often represented as \(a \in A\), where \(A\) is the set of possible actions.
State (\(s\)): The current configuration or situation of the environment, usually denoted as \(s \in S\), where \(S\) is the state space.
Trajectory (\(\tau\)): A trajectory τ is a sequence of states, actions, and rewards experienced by an agent: \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, ..., s_T, a_T, r_T)\).
Trajectory Distribution (\((\tau|\pi)\)): The probability of a trajectory under policy \(\pi\) is \(P(\tau|\pi) = p(s_0)\prod_{t=0}^T \pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\), where \(p(s_0)\) is the initial state distribution and \(p(s_{t+1}|s_t,a_t)\) is the transition probability.
Policy (\(\pi\)), also called the policy model in RLHF: In RL, a policy is a strategy or rule that the agent follows to decide which action to take in a given state: \(\pi(a|s)\).
Value Function (\(V\)): A function that estimates the expected cumulative reward from a given state: \(V(s) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s]\).
Q-Function (\(Q\)): A function that estimates the expected cumulative reward from taking a specific action in a given state: \(Q(s,a) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s, a_0 = a]\).
Advantage Function (\(A\)): The advantage function \(A(s,a)\) quantifies the relative benefit of taking action \(a\) in state \(s\) compared to the average action. It’s defined as \(A(s,a) = Q(s,a) - V(s)\). Advantage functions (and value functions) can depend on a specific policy, \(A^\pi(s,a)\).
Expectation of Reward Optimization: The primary goal in RL, which involves maximizing the expected cumulative reward:
\(\max_{\theta} \mathbb{E}_{s \sim \rho_\pi, a \sim \pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]\)
where \(\rho_\pi\) is the state distribution under policy \(\pi\), and \(\gamma\) is the discount factor.
Finite Horizon Reward (\(J(\pi_\theta)\)): The expected finite-horizon discounted return of the policy \(\pi_\theta\), parameterized by \(\theta\) is defined as: \(J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r_t \right]\) where \(\tau \sim \pi_\theta\) denotes trajectories sampled by following policy \(\pi_\theta\) and \(T\) is the finite horizon.
On-policy: In RLHF, particularly in the debate between RL and Direct Alignment Algorithms, the discussion of on-policy data is common. In the RL literature, on-policy means that the data is generated exactly by the current form of the agent, but in the general preference-tuning literature, on-policy is expanded to mean generations from that edition of model – e.g. a instruction tuned checkpoint before running any preference fine-tuning. In this context, off-policy could be data generated by any other language model being used in post-training.
RLHF Only Definitions
- Reference Model (\(\pi_\text{ref}\)): This is a saved set of parameters used in RLHF where outputs of it are used to regularize the optimization.