Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

RL Cheatsheet

A one-page reference of all core RL loss functions used in language model post-training. Covers policy gradient methods, reward modeling, and direct alignment. From the Reinforcement Learning and Direct Alignment chapters.

Download PDF Download TeX source

Best viewed on desktop. For mobile, download the PDF above.

Core loss \(\mathcal{L}(\theta)\) per RL algorithm


Policy Gradient \(\displaystyle -\;\mathbb{E}_\tau\!\left[\sum_{t=0}^{T} \log \pi_\theta(a_t\mid s_t)\, A^{\pi_\theta}(s_t, a_t)\right]\)
REINFORCE (REward Increment = Nonnegative Factor × Offset Reinforcement × Characteristic Eligibility) \(\displaystyle -\;\frac{1}{T}\sum_{t=1}^{T}\log \pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))\)
REINFORCE Leave One Out (RLOO) \(\displaystyle -\;\frac{1}{K}\sum_{i=1}^{K}\sum_t \log \pi_\theta(a_{i,t}\mid s_{i,t})\!\left(R_i - \frac{1}{K\!-\!1}\sum_{j\neq i}R_j\right)\)
Proximal Policy Optimization (PPO) \(\displaystyle -\;\frac{1}{T}\sum_{t=1}^{T}\min\!\left(\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)} A_t,\; \text{clip}\!\left(\frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_{\text{old}}}(a_t\mid s_t)},\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon\right) A_t\right)\)
Group Relative Policy
Optimization (GRPO)
\(\displaystyle -\;\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|a_i|}\sum_{t=1}^{|a_i|}\min\!\left(\frac{\pi_\theta(a_{i,t}\mid s_{i,t})}{\pi_{\theta_{\text{old}}}(a_{i,t}\mid s_{i,t})} \hat{A}_i,\; \text{clip}\!\left(\frac{\pi_\theta(a_{i,t}\mid s_{i,t})}{\pi_{\theta_{\text{old}}}(a_{i,t}\mid s_{i,t})},\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon\right) \hat{A}_i\right)\)

where \(\hat{A}_i = \dfrac{r_i - \overline{r}}{\text{std}(r)}\)

Group Sequence Policy
Optimization (GSPO)
\(\displaystyle -\;\frac{1}{G}\sum_{i=1}^{G}\min\!\left(\!\left(\frac{\pi_\theta(a_i\mid s)}{\pi_{\theta_{\text{old}}}(a_i\mid s)}\right)^{\!\frac{1}{|a_i|}} A_i,\; \text{clip}\!\left(\!\left(\frac{\pi_\theta(a_i\mid s)}{\pi_{\theta_{\text{old}}}(a_i\mid s)}\right)^{\!\frac{1}{|a_i|}},\, 1\!-\!\varepsilon,\, 1\!+\!\varepsilon\right) A_i\right)\)
Clipped Importance Sampling
Policy Optimization (CISPO)
\(\displaystyle -\;\frac{1}{\sum_i |a_i|}\sum_{i=1}^{K}\sum_{t=1}^{|a_i|} \text{sg}\!\Big(\text{clip}\big(\tfrac{\pi_\theta(a_{i,t}\mid s)}{\pi_{\theta_{\text{old}}}(a_{i,t}\mid s)},\, 1\!-\!\varepsilon_{\text{low}},\, 1\!+\!\varepsilon_{\text{high}}\big)\Big)\, A_{i,t} \log \pi_\theta(a_{i,t}\mid s)\)

where \(\text{sg}(\cdot)\) = stop gradient

Other core equations


RLHF Objective \(\displaystyle J(\theta) = \max_{\pi}\; \mathbb{E}_{x \sim \mathcal{D}}\,\mathbb{E}_{y \sim \pi(y|x)} \Big[r_\theta(x, y)\Big] - \beta\, \mathcal{D}_{\text{KL}}\!\Big(\pi(y|x) \,\|\, \pi_{\text{ref}}(y|x)\Big)\)
Bradley–Terry
Reward Model
\(\displaystyle \mathcal{L}(\theta) = -\log \sigma\!\Big( r_{\theta}(y_c \mid x) - r_{\theta}(y_r \mid x) \Big)\)
Direct Preference
Optimization (DPO)
\(\displaystyle \mathcal{L}(\theta) = -\;\mathbb{E}\!\left[ \log \sigma\!\left( \beta \log \frac{\pi_{\theta}(y_c \mid x)}{\pi_{\text{ref}}(y_c \mid x)} - \beta \log \frac{\pi_{\theta}(y_r \mid x)}{\pi_{\text{ref}}(y_r \mid x)} \right) \right]\)

Notation


\(x, y\) Prompt, completion \(A_t\) Advantage estimate
\(y_c,\, y_r\) Chosen / rejected completion \(\beta\) KL penalty coefficient
\(y_c \succ y_r\) \(y_c\) is preferred over \(y_r\) \(\sigma(z)\) Sigmoid: \(1/(1+e^{-z})\)
\(\pi_\theta\) Policy (the model being trained) \(\mathcal{D}_{\text{KL}}(P \| Q)\) KL divergence between \(P\) and \(Q\)
\(\pi_{\text{ref}}\) Reference policy (frozen copy) \(G_t\) Return: \(\sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\)
\(\pi_{\theta_{\text{old}}}\) Policy at start of RL batch updates \(V(s)\) Value: \(\mathbb{E}[G_t \mid S_t = s]\)
\(r_\theta(y \mid x)\) Reward model score