Problem Setup
This chapter includes all the definitions, symbols, and operatings frequently used in the RLHF process.
ML Definitions
- Kullback-Leibler (KL) divergence (\(D_{KL}(P || Q)\)), also known as KL divergence, is a measure of the difference between two probability distributions. For discrete probability distributions \(P\) and \(Q\) defined on the same probability space \(\mathcal{X}\), the KL distance from \(Q\) to \(P\) is defined as:
\[ D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) \]
NLP Definitions
Prompt (\(x\)): The input text given to a language model to generate a response or completion.
Completion (\(y\)): The output text generated by a language model in response to a prompt. Often the completion is denoted as \(y|x\).
Chosen Completion (\(y_c\)): The completion that is selected or preferred over other alternatives, often denoted as \(y_{chosen}\).
Preference Relation (\(\succ\)): A symbol indicating that one completion is preferred over another, e.g., \(y_{chosen} \succ y_{rejected}\).
Policy (\(\pi\)): A probability distribution over possible completions, parameterized by \(\theta\): \(\pi_\theta(y|x)\).
RL Definitions
Reward (\(r\)): A scalar value indicating the desirability of an action or state, typically denoted as \(r\).
Action (\(a\)): A decision or move made by an agent in an environment, often represented as \(a \in A\), where \(A\) is the set of possible actions.
State (\(s\)): The current configuration or situation of the environment, usually denoted as \(s \in S\), where \(S\) is the state space.
Trajectory (\(\tau\)): A trajectory τ is a sequence of states, actions, and rewards experienced by an agent: \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, ..., s_T, a_T, r_T)\).
Trajectory Distribution (\((\tau|\pi)\)): The probability of a trajectory under policy \(\pi\) is \(P(\tau|\pi) = p(s_0)\prod_{t=0}^T \pi(a_t|s_t)p(s_{t+1}|s_t,a_t)\), where \(p(s_0)\) is the initial state distribution and \(p(s_{t+1}|s_t,a_t)\) is the transition probability.
Policy (\(\pi\)): In RL, a policy is a strategy or rule that the agent follows to decide which action to take in a given state: \(\pi(a|s)\).
Value Function (\(V\)): A function that estimates the expected cumulative reward from a given state: \(V(s) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s]\).
Q-Function (\(Q\)): A function that estimates the expected cumulative reward from taking a specific action in a given state: \(Q(s,a) = \mathbb{E}[\sum_{t=0}^{\infty} \gamma^t r_t | s_0 = s, a_0 = a]\).
Advantage Function (\(A\)): The advantage function \(A(s,a)\) quantifies the relative benefit of taking action \(a\) in state \(s\) compared to the average action. It’s defined as \(A(s,a) = Q(s,a) - V(s)\). Advantage functions (and value functions) can depend on a specific policy, \(A^\pi(s,a)\).
Expectation of Reward Optimization: The primary goal in RL, which involves maximizing the expected cumulative reward:
\(\max_{\theta} \mathbb{E}_{s \sim \rho_\pi, a \sim \pi_\theta}[\sum_{t=0}^{\infty} \gamma^t r_t]\)
where \(\rho_\pi\) is the state distribution under policy \(\pi\), and \(\gamma\) is the discount factor.
Finite Horizon Reward (\(J(\pi_\theta)\)): The expected finite-horizon undiscounted return of the policy \(\pi_\theta\), parameterized by \(\theta\) is defined as: \(J(\pi_\theta) = E_\tau~\pi_\theta [\sum^T_t=0 r_t]\) where \(\tau ~ \pi_\theta\) denotes trajectories sampled by following policy \(\pi_\theta\) and \(T\) is the finite horizon.