Policy Gradient Algorithms🔗
The algorithms that popularized RLHF for language models were policy-gradient reinforcement learning algorithms. These algorithms, such as PPO, GRPO, and REINFORCE, use recently generated samples to update their model rather than storing scores in a replay buffer. In this section we will cover the fundamentals of the policy gradient algorithms and how they are used in the modern RLHF framework.
At a machine learning level, this section is the subject with the highest complexity in the RLHF process. Though, as with most modern AI models, the largest determining factor on its success is the data provided as inputs to the process.
The most popular algorithms used for RLHF has evolved over time. When RLHF came onto the scene with ChatGPT, it was largely known that they used a variant of PPO, and many initial efforts were built upon that. Over time, multiple research projects showed the promise of REINFORCE style algorithms [1] [2], touted for its simplicity over PPO without a reward model (saves memory and therefore the number of GPUs required) and with simpler value estimation (no GAE). More algorithms have emerged, including Group Relative Policy Optimization, which is particularly popular with reasoning tasks, but in general many of these algorithms can be tuned to fit a specific task. In this chapter, we cover the core policy gradient setup and the three algorithms mentioned above due to their central role in the establishment of a canonical RLHF literature.
For definitions of symbols, see the problem setup chapter.
Policy Gradient Algorithms🔗
Reinforcement learning algorithms are designed to maximize the
future, discounted reward across a trajectory of states,
The return definition can also be estimated as:
This return is the basis for learning a value function
All policy gradient algorithms solve an objective for such a value
function induced from a specific policy,
Where
The core of policy gradient algorithms is computing the gradient
with respect to the finite time expected return over the current
policy. With this expected return,
The core implementation detail is how to compute said gradient.
Schulman et al. 2015 provides an overview of the different ways that
policy gradients can be computed [3]. The goal is to estimate
the exact gradient
Where
: total reward of the trajectory. : reward following action , also described as the return, . : baselined version of previous formula. : state-action value function. : advantage function, which yields the lowest possible theoretical variance if it can be computed accurately. : TD residual.
The baseline is a value used to reduce variance of policy updates (more on this below).
For language models, some of these concepts do not make as much
sense. For example, we know that for a deterministic policy the value
function is defined as
Which is a combination of the reward, the value of the prompt, and the discounted value of the entire utterance.
Vanilla Policy Gradient🔗
The vanilla policy gradient implementation optimizes the above
expression for
A common problem with vanilla policy gradient algorithms is the
high variance in gradient updates, which can be mitigated in multiple
ways. In order to alleviate this, various techniques are used to
normalize the value estimation, called baselines. Baselines
accomplish this in multiple ways, effectively normalizing by the value
of the state relative to the downstream action (e.g. in the case of
Advantage, which is the difference between the Q value and the value).
The simplest baselines are averages over the batch of rewards or a
moving average. Even these baselines can de-bias the gradients so
Many of the policy gradient algorithms discussed in this chapter build on the advantage formulation of policy gradient:
A core piece of the policy gradient implementation involves taking the derivative of the probabilistic policies. This comes from:
Which is derived from the chain rule:
We will use this later on in the chapter.
REINFORCE🔗
The algorithm REINFORCE is likely a backronym, but the components of the algorithms it represents are quite relevant for modern reinforcement learning algorithms. Defined in the seminal paper Simple statistical gradient-following algorithms for connectionist reinforcement learning [4]:
The name is an acronym for “REward Increment = Nonnegative Factor X Offset Reinforcement X Characteristic Eligibility.”
The three components of this are how to do the reward increment, a.k.a. the policy gradient step. It has three pieces to the update rule:
- Nonnegative factor: This is the learning rate (step size) that
must be a positive number, e.g.
below. - Offset Reinforcement: This is a baseline
or other normalizing factor of the reward to improve stability. - Characteristic Eligibility: This is how the learning becomes
attributed per token. It can be a general value,
per parameter, but is often log probabilities of the policy in modern equations.
Thus, the form looks quite familiar:
With more modern notation and the generalized return
Here, the value
REINFORCE is a specific implementation of vanilla policy gradient that uses a Monte Carlo estimator of the gradient.
REINFORCE can be run without value network – the value network is for the baseline in the policy gradient. PPO on the other hand needs the value network to accurately compute the advantage function.
REINFORCE Leave One Out (RLOO)🔗
The core implementation detail of REINFORCE Leave One Out versus standard REINFORCE is that it takes the average reward of the other samples in the batch to compute the baseline – rather than averaging over all rewards in the batch [5], [1], [6].
Crucially, this only works when generating multiple responses per prompt, which is common practice in multiple domains of finetuning language models with RL.
Specifically, for the REINFORCE Leave-One-Out (RLOO) baseline,
given
resulting in the advantage:
Equivalently, this can be expressed as:
This is a simple, low-variance advantage update that is very similar to GRPO, which will be discussed later, where REINFORCE is used with a different location of KL penalty and without step-size clipping. Still, the advantage from RLOO could be combined with the clipping of PPO, showing how similar many of these algorithms are.
RLOO and other algorithms that do not use a value network assign the advantage (or reward) of the sequence to every token for the loss computation. Algorithms that use a learned value network, such as PPO, assign a different value to every token individually, discounting from the final reward achieved at the EOS token. For example, with the KL divergence distance penalty, RLOO sums it over the completion while PPO and similar algorithms compute it on a per-token basis and subtract it from the reward (or the advantage, in the case of GRPO). These details and trade-offs are discussed later in the chapter.
Proximal Policy Optimization🔗
Proximal Policy Optimization (PPO) [7] is one of the foundational algorithms to Deep RL’s successes (such as OpenAI’s DOTA 5 [8] and large amounts of research). The loss function is as follows per sample:
For language models, the loss is computed per token, which intuitively can be grounded in how one would compute the probability of the entire sequence of autoregressive predictions – by a product of probabilities. From there, the common implementation is with log-probabilities that make the computation far more tractable.
This is the per-token version of PPO, which also applies to other
policy-gradient methods, but is explored further later in the
implementation section of this chapter. Here, the term for averaging
by the number of tokens in the action,
Here we will explain the difference cases this loss function triggers given various advantages and policy ratios. At an implementation level, the inner computations for PPO involve standard policy gradient and a clipped policy gradient.
To understand how different situations emerge, we can define the policy ratio as:
The policy ratio is a centerpiece of PPO and related algorithms. It emerges from computing the gradient of a policy and controls the parameter updates in a very intuitive way. For any batch of data, the policy ratio starts at 1 for the first gradient step for that batch (common practice is to take 1-4 gradient steps per batch with policy gradient algorithms). Then, the policy ratio will be above one if that gradient step increased the likelihood of certain tokens with an associated positive advantage, or less than one for the other case.
The policy ratio and advantage together can occur in a few different configurations.
The first case is when the advantage is positive and the policy
ratio exceeds
This will increase the probability ratio, making the action even more likely, but only up until the clipping parameter epsilon. The similar conditions are when the advantage is still positive, but the likelihood ratio shifts.
For positive advantage and ratio less than
That reduces to
because of the less than assumption.
Similarly, if the probability ratio is not clipping, the objective
also reduces to the
If the advantage is negative, this looks similar. A clipped
objective will occur when
Which, because
Then the objective becomes:
The other cases follow as above, inverted, and are left as an exercise to the reader.
All of these are designed to make the behaviors where advantage is positive more likely and keep the gradient step within the trust region. It is crucial to remember that PPO within the trust region is the same as standard forms of policy gradient.
Group Relative Policy Optimization🔗
Group Relative Policy Optimization (GRPO) is introduced in DeepSeekMath [10], and used in other DeepSeek works, e.g. DeepSeek-V3 [11] and DeepSeek-R1 [12]. GRPO can be viewed as PPO-inspired algorithm with a very similar surrogate loss, but it avoids learning a value function with another copy of the original policy language model (or another checkpoint for initialization). This brings two posited benefits:
- Avoiding the challenge of learning a value function from a LM backbone, where research hasn’t established best practices.
- Saves memory by not needing to keep another set of model weights in memory.
GRPO does this by simplifying the value estimation and assigning
the same value to every token in the episode (i.e. in the completion
to a prompt, each token gets assigned the same value rather than
discounted rewards in a standard value function) by estimating the
advantage or baseline. The estimate is done by collecting multiple
completions (
To state this formally, the GRPO objective is very similar to the
PPO objective above. For GRPO, the loss is accumulated over a group of
responses
As above, we can expand this into a per-token loss computation:
Note that relative to PPO, the standard implementation of GRPO
includes the KL distance in the loss. With the advantage computation
for the completion index
Intuitively, the GRPO update is comparing multiple answers to a single question within a batch. The model learns to become more like the answers marked as correct and less like the others. This is a very simple way to compute the advantage, which is the measure of how much better a specific action is than the average at a given state. Relative to PPO, REINFORCE, and broadly RLHF performed with a reward model rating (relative to output reward), GRPO is often run with a far higher number of samples per prompt. Here, the current policy generates multiple responses to a given prompt, and the group-wise GRPO advantage estimate is given valuable context.
The advantage computation for GRPO has trade-offs in its biases. The normalization by standard deviation is rewarding questions in a batch that have a low variation in answer correctness. For questions with either nearly all correct or all incorrect answers, the standard deviation will be lower and the advantage will be higher. [9] proposes removing the standard deviation term given this bias, but this comes at the cost of down-weighing questions that were all incorrect with a few correct answers, which could be seen as valuable learning signal.
eq. 29 is the implementation of GRPO when working with outcome supervision (either a standard reward model or a single verifiable reward) and a different implementation is needed with process supervision. In this case, GRPO computes the advantage as the sum of the normalized rewards for the following reasoning steps.
Finally, GRPO’s advantage estimation can also be applied without the PPO clipping to more vanilla versions of policy gradient (e.g. REINFORCE), but it is not the canonical form. As an example of how these algorithms are intertwined, we can show that the advantage estimation in a variant of GRPO, Dr. GRPO (GRPO Done Right) [9], is equivalent to the RLOO estimation up to a constant scaling factor (which normally does not matter due to implementation details to normalize the advantage). Dr. GRPO removes the standard deviation normalization term from eq. 29 – note that this also scales the advantage up, which is equivalent to increasing the GRPO learning rate on samples with a variance in answer scores. This addresses a bias towards questions with low reward variance – i.e. almost all the answers are right or wrong – but comes at a potential cost where problems where just one sample gets the answer right are important to learn from. The Dr. GRPO advantage for completion i within a group of size G is defined as:
Here, in the same notation we can recall the RLOO advantage estimation as:
Thus, if we multiply the Dr. GRPO advantage definition by
Implementation🔗
Compared to the original Deep RL literature where many of these algorithms were developed, implementing RL for optimizing language models or other large AI models requires many small implementation details. In this section, we highlight some key factors that differentiate the implementations of popular algorithms.
There are many other small details that go into this training. For
example, when doing RLHF with language models a crucial step is
generating text that will then be rated by the reward model. Under
normal circumstances, the model should generate a end-of-sequence
(EOS) token indicating it finished generating, but a common practice
is to put a hard cap on generation length to efficiently utilize
infrastructure. A failure mode of RLHF is that the model is regularly
truncated in its answers, driving the ratings from the reward model
out of distribution and to unpredictable scores. The solution to this
is to only run a reward model ranking on the
eos_token
, and to otherwise assign a penalty to the model
for generating too long.
The popular open-source tools for RLHF have a large variance in implementation details across the algorithms (see table 10 in [13]). Some decisions not covered here include:
- Value network initialization: The internal learned value network used by PPO and other similar algorithms can be started from a different model of the same architecture or randomly selected weights. This can have a large impact on performance.
- Reward normalization, reward whitening, and/or advantage whitening: Where normalization bounds all the values from the RM (or environment) to be between 0 and 1, which can help with learning stability, whitening the rewards or the advantage estimates to uniform covariates can provide an even stronger boost to stability.
- Different KL estimators: With complex language models, precisely computing the KL divergence between models can be complex, so multiple approximations are used to substitute for an exact calculation [14].
- KL controllers: Original implementations of PPO and related algorithms had dynamic controllers that targeted specific KLs and changed the penalty based on recent measurements. Most modern RLHF implementations use static KL penalties, but this can also vary.
For more details on implementation details for RLHF, see [15]. For further information on the algorithms, see [16].
Policy Gradient Basics🔗
A simple implementation of policy gradient, using advantages to estimate the gradient to prepare for advanced algorithms such as PPO and GRPO follows:
= -advantages * ratio pg_loss
Ratio here is the logratio of the new policy model probabilities relative to the reference model.
In order to understand this equation it is good to understand different cases that can fall within a batch of updates. Remember that we want the loss to decrease as the model gets better at the task.
Case 1: Positive advantage, so the action was better than the expected value of the state. We want to reinforce this. In this case, the model will make this more likely with the negative sign. To do so it’ll increase the logratio. A positive logratio, or sum of log probabilities of the tokens, means that the model is more likely to generate those tokens.
Case 2: Negative advantage, so the action was worse than the expected value of the state. This follows very similarly. Here, the loss will be positive if the new model was more likely, so the model will try to make it so the policy parameters make this completion less likely.
Case 3: Zero advantage, so no update is needed. The loss is zero, don’t change the policy model.
Loss Aggregation🔗
The question when implementing any policy gradient algorithm with language models is: How do you sum over the KL distance and loss to design different types of value-attribution.
Most of the discussions in this section assume a token-level action, where the RL problem is formatted as a Markov Decision Process (MDP) rather than a bandit problem. In a bandit problem, all the tokens in an action will be given the same loss, which has been the default implementation for some algorithms such as Advantage-Leftover Lunch RL (A-LoL) [17]. The formulation between MDP and bandit is actually an implementation detail over how the loss is aggregated per-sample. A bandit approach takes a mean that assigns the same loss to every token, which also aligns with DPO and other direct alignment algorithms’ standard implementations.
Consider an example where we have the following variables, with a batch size B and sequence length L.
# [B, 1]
advantages # [B, L] per_token_probability_ratios
We can approximate the loss as above with a batch multiplication of
pg_loss = -advantages * ratio
. Multiplying these together
is broadcasting the advantage per each completion in the batch (as in
the outcome reward setting, rather than a per-token value model
setting) to be the same. They are then multiplied by the per token
probability logratios.
In cases where a value network is used, it is easy to see that the different losses can behave very differently. When outcome rewards are used, the advantages are set to be the same per token, so the difference in per-token probability is crucial to policy gradient learning dynamics.
In the below implementations of GRPO and PPO, the loss is summed over the tokens in the completion:
= ((per_token_loss * completion_mask).sum(dim=1) / \
sequence_loss sum(dim=1)).mean() completion_mask.
The operation above is very similar to a masked_mean
operation. An alternative is to average over each token
individually.
= ((per_token_loss * completion_mask).sum() / \
token_loss sum()) completion_mask.
Intuitively, it could seem that averaging over the sequence is
best, as we are trying to reward the model for outcomes and
the specific tokens are not as important. This can introduce subtle
forms of bias. Consider two sequences of different lengths, assigned
two different advantages a_1
and a_2
.
= [a_1, a_1, a_1, a_1, a_1] # 5 tokens
seq_1_advs = [a_2, a_2, a_2, a_2, a_2, a_2, a_2, a_2, a_2, a_2] # 10 tokens seq_2_advs
Now, consider if the last token in each sequence is important to the advantage being positive, so it gets increased over the multiple gradient steps per batch. When you convert these to per-token losses, you could get something approximate to:
= [1, 1, 1, 1, 10] # 5 tokens
seq_1_losses = [1, 1, 1, 1, 1, 1, 1, 1, 1, 10] # 10 tokens seq_2_losses
If we average these over the sequences, we will get the following numbers:
seq_1_loss = 2.8
seq_2_loss = 1.9
If we average these together weighting sequences equally, we get a loss of 2.35. If, instead we apply the loss equally to each token, the loss would be computed by summing all the per token losses and normalizing by length, which in this case would be 2.27. If the sequences had bigger differences, the two loss values can have substantially different values.
For a more complete example on how loss aggregation changes the
loss per-token and per-example, see the below script that computes the
loss over a toy batch with two samples, one long and one short. The
example uses three loss aggregation techniques:
masked_mean
corresponds to a per-sample length
normalization, the loss proposed in DAPO [18] with token level normalization
per batch, masked_mean_token_level
, and
masked_sum_result
with a fixed length normalization from
the max length from Dr. GRPO [9].
from typing import Optional
import torch
def masked_mean(values: torch.Tensor, mask: torch.Tensor, axis: Optional[int] = None) -> torch.Tensor:
"""Compute mean of tensor with a masked values."""
if axis is not None:
return (values * mask).sum(axis=axis) / mask.sum(axis=axis)
else:
return (values * mask).sum() / mask.sum()
def masked_sum(
values: torch.Tensor,
mask: torch.Tensor,bool] = None,
axis: Optional[float = 1.0,
constant_normalizer: -> torch.Tensor:
) """Compute sum of tensor with a masked values. Use a constant to normalize."""
if axis is not None:
return (values * mask).sum(axis=axis) / constant_normalizer
else:
return (values * mask).sum() / constant_normalizer
= torch.tensor([
ratio 1., 1, 1, 1, 1, 1, 1,],
[1, 1, 1, 1, 1, 1, 1,],
[=True)
], requires_grad
= torch.tensor([
advs 2, 2, 2, 2, 2, 2, 2,],
[2, 2, 2, 2, 2, 2, 2,],
[
])
= torch.tensor([
masks # generation 1: 4 tokens
1, 1, 1, 1, 0, 0, 0,],
[# generation 2: 7 tokens
1, 1, 1, 1, 1, 1, 1,],
[
])
= 7
max_gen_len
= masked_mean(ratio * advs, masks, axis=1)
masked_mean_result = masked_mean(ratio, masks, axis=None)
masked_mean_token_level = masked_sum(ratio * advs, masks, axis=1, constant_normalizer=max_gen_len)
masked_sum_result
print("masked_mean", masked_mean_result)
print("masked_sum", masked_sum_result)
print("masked_mean_token_level", masked_mean_token_level)
# masked_mean tensor([2., 2.], grad_fn=<DivBackward0>)
# masked_sum tensor([1.1429, 2.0000], grad_fn=<DivBackward0>)
# masked_mean_token_level tensor(1., grad_fn=<DivBackward0>)
masked_mean_result.mean().backward()print("ratio.grad", ratio.grad)
ratio.grad.zero_()# ratio.grad tensor([[0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000],
# [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429]])
masked_sum_result.mean().backward()print("ratio.grad", ratio.grad)
# ratio.grad tensor([[0.1429, 0.1429, 0.1429, 0.1429, 0.0000, 0.0000, 0.0000],
# [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429]])
masked_mean_token_level.mean().backward()print("ratio.grad", ratio.grad)
# ratio.grad tensor([[0.2338, 0.2338, 0.2338, 0.2338, 0.0000, 0.0000, 0.0000],
# [0.2338, 0.2338, 0.2338, 0.2338, 0.2338, 0.2338, 0.2338]])
Here it can be seen for the default GRPO implementation,
masked_mean
, the short length has a bigger per-token
gradient than the longer one, and the two implementations of Dr. GRPO
and DAPO balance it out. Note that these results can vary
substantially if gradient accumulation is used, where the gradients
are summed across multiple mini batches before taking a backward step.
In this case, the balance between shorter and longer sequences can
flip.
Another way to aggregate loss is discussed in [9] that has its origins in pre language model RL research, where every per-token loss is normalized by the max sequence length set in the experiment. This would change how the losses compare across batches per tokens in the above example.
In practice, the setup that is best likely is the one that is suited to the individual, online learning setup. Often in RLHF methods the method with the best numerical stability and or the least variance in loss could be preferred.
Proximal Policy Optimization🔗
There are many, many implementations of PPO available. The core loss computation is shown below. Crucial to stable performance is also the value computation, where multiple options exist (including multiple options for the value model loss).
Note that the reference policy (or old logprobs) here are from the time the generations were sampled and not necessarily the reference policy. The reference policy is only used for the KL distance constraint/penalty.
# B: Batch Size, L: Sequence Length, G: Num of Generations
# Apply KL penalty to rewards
= rewards - self.beta * per_token_kl # Shape: (B*G, L)
rewards
# Get value predictions
= value_net(completions) # Shape: (B*G, L)
values
# Compute simple advantages
= rewards - values.detach() # Shape: (B*G, L)
advantages
# Normalize advantages (optional but stable)
= (advantages - advantages.mean()) / (advantages.std() + 1e-8)
advantages = advantages.unsqueeze(1) # Shape: (B*G, 1)
advantages
# Compute probability ratio between new and old policies
= torch.exp(new_per_token_logps - per_token_logps) # Shape: (B*G, L)
ratio
# PPO clipping objective
= self.cliprange # e.g. 0.2
eps = -advantages * ratio # Shape: (B*G, L)
pg_losses1 = -advantages * torch.clamp(ratio, 1.0 - eps, 1.0 + eps) # Shape: (B*G, L)
pg_losses2 = torch.max(pg_losses1, pg_losses2) # Shape: (B*G, L)
pg_loss_max
# Simple value function loss
= 0.5 * ((rewards - values) ** 2) # Shape: (B*G, L)
vf_loss
# Combine policy and value losses
= pg_loss_max + self.vf_coef * vf_loss # Shape: (B*G, L)
per_token_loss
# Apply completion mask and compute final loss
= ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
loss # Scalar
# Compute metrics for logging
with torch.no_grad():
# Compute clipping fraction
= ((pg_losses2 > pg_losses1).float() * completion_mask).sum() / completion_mask.sum()
clip_frac
# Compute approximate KL
= 0.5 * ((new_per_token_logps - per_token_logps)**2).mean()
approx_kl
# Compute value loss for logging
= vf_loss.mean() value_loss
The core piece to understand with PPO is how the policy gradient loss is updated. Focus on these three lines:
= -advantages * ratio # Shape: (B*G, L)
pg_losses1 = -advantages * torch.clamp(ratio, 1.0 - eps, 1.0 + eps) # Shape: (B*G, L)
pg_losses2 = torch.max(pg_losses1, pg_losses2) # Shape: (B*G, L) pg_loss_max
pg_losses1
is the same as the vanilla advantage-based
PR loss above, which is included in PPO, but the loss (and gradient
update) can be clipped. Though, PPO is controlling the update size to
not be too big. Because losses can be negative, we must create a more
conservative version of the vanilla policy gradient update rule.
We know that if we do not constrain the loss, the policy gradient algorithm will update the weights exactly to the new probability distribution. Hence, by clamping the logratio’s, PPO is limiting the distance that the update can move the policy parameters.
Finally, the max of two is taken as mentioned above, in order to take the more conservative loss update.
For PPO, all of this happens while learning a value function, which opens more complexity, but this is the core logic for the parameter update.
PPO/GRPO simplification with 1 gradient step per sample (no clipping)🔗
PPO (and GRPO) implementations can be handled much more elegantly
if the hyperparameter “number of gradient steps per sample” is equal
to 1. Many normal values for this are from 2-4 or higher. In the main
PPO or GRPO equations, see eq. 18, the
“reference” policy is the previous parameters – those used to generate
the completions or actions. Thus, if only one gradient step is taken,
This leads to PPO or GRPO implementations where the second policy gradient and clipping logic can be omitted, making the optimizer far closer to standard policy gradient.
Group Relative Policy Optimization🔗
The DeepSeekMath paper details some implementation details of GRPO
that differ from PPO [10], especially if comparing to a
standard application of PPO from Deep RL rather than language models.
For example, the KL penalty within the RLHF optimization (recall the
KL penalty is also used when training reasoning models on verifiable
rewards without a reward model) is applied directly in the loss update
rather to the reward function. Where the standard KL penalty
application for RLHF is applied as
Though, there are multiple ways to implement this. Traditionally,
the KL distance is computed with respect to each token in the
completion to a prompt
Putting it together, using the first loss accumulation, the pseudocode can be written as below.
# B: Batch Size, L: Sequence Length, G: Number of Generations
# Compute grouped-wise rewards # Shape: (B,)
= rewards.view(-1, self.num_generations).mean(dim=1)
mean_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)
std_grouped_rewards
# Normalize the rewards to compute the advantages
= mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
mean_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
std_grouped_rewards # Shape: (B*G,)
# Compute advantages
= (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)
advantages = advantages.unsqueeze(1)
advantages # Shape: (B*G, 1)
# Compute probability ratio between new and old policies
= torch.exp(new_per_token_logps - per_token_logps) # Shape: (B*G, L)
ratio
# PPO clipping objective
= self.cliprange # e.g. 0.2
eps = -advantages * ratio # Shape: (B*G, L)
pg_losses1 = -advantages * torch.clamp(ratio, 1.0 - eps, 1.0 + eps) # Shape: (B*G, L)
pg_losses2 = torch.max(pg_losses1, pg_losses2) # Shape: (B*G, L)
pg_loss_max
# important to GRPO -- PPO applies this in reward traditionally
# Combine with KL penalty
= pg_loss_max + self.beta * per_token_kl # Shape: (B*G, L)
per_token_loss
# Apply completion mask and compute final loss
= ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
loss # Scalar
# Compute core metric for logging (KL, reward, etc. also logged)
with torch.no_grad():
# Compute clipping fraction
= ((pg_losses2 > pg_losses1).float() * completion_mask).sum() / completion_mask.sum()
clip_frac
# Compute approximate KL
= 0.5 * ((new_per_token_logps - per_token_logps)**2).mean() approx_kl
For more details on how to interpret this code, see the PPO section above.
RLOO vs. GRPO🔗
The advantage updates for RLOO follow very closely to GRPO, highlighting the conceptual similarity of the algorithm when taken separately from the PPO style clipping and KL penalty details. Specially, for RLOO, the advantage is computed relative to a baseline that is extremely similar to that of GRPO – the completion reward relative to the others for that same question. Concisely, the RLOO advantage estimate follows as (expanded from TRL’s implementation):
# rloo_k --> number of completions per prompt
# rlhf_reward --> Initially a flat tensor of total rewards for all completions. Length B = N x k
= rlhf_reward.reshape(rloo_k, -1) #
rlhf_reward # Now, Shape: (k, N), each column j contains the k rewards for prompt j.
= (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
baseline # baseline --> Leave-one-out baseline rewards. Shape: (k, N)
# baseline[i, j] is the avg reward of samples i' != i for prompt j.
= rlhf_reward - baseline
advantages # advantages --> Same Shape: (k, N)
= advantages.flatten() # Same shape as original tensor advantages
The rest of the implementation details for RLOO follow the other trade-offs of implementing policy-gradient.
Auxiliary Topics🔗
In order to master the application of policy-gradient algorithms, there are countless other considerations. Here we consider some, but not all of these discussions.
Generalized Advantage Estimation (GAE)🔗
Generalized Advantage Estimation (GAE) is an alternate method to compute the advantage for policy gradient algorithms [3] that better balances the bias-variance tradeoff. Traditional single-step advantage estimates often suffer from high variance, while using complete trajectories can introduce too much bias. GAE works by combining two ideas – multi-step prediction and weighted running average (or just one of these).
Advantage estimates can take many forms, but we can define a
Here a shorter
To utilize this, we introduce another variable
Intuitively, this can be used to average of multi-step estimates of Advantage in an elegant fashion.
For further reading, see [19].
Double Regularization🔗
Many popular policy gradient algorithms from Deep Reinforcement Learning originated due to the need to control the learning process of the agent. In RLHF, as discussed extensively in Chapter 8 on Regularization and in Chapter 4 on Problem Formulation, there is a built in regularization term via the distance penalty relative to the original policy one is finetuning. In this view, a large part of the difference between algorithms like PPO (which have internal step-size regularization) and REINFORCE (which is simpler, and PPO under certain hyperparameters reduces to) is far less meaningful for finetuning language models than training agents from scratch.
In PPO, the objective that handles capping the step-size of the update is known as the surrogate objective. To monitor how much the PPO regularization is impacting updates in RLHF, one can look at the clip fraction variable in many popular implementations, which is the percentage of samples in the batch where the gradients are clipped by this regularizer in PPO. These gradients are reduced to a maximum value.
Further Reading🔗
As RLHF has cemented itself at the center of modern post-training, other policy-gradient RL algorithms and RL algorithms generally have been proposed to improve the training process, but they have not had a central role in governing best practices. Examples for further reading include:
- Pairwise Proximal Policy Optimization (P3O) [20] uses pairwise data directly in a PPO-style policy update without learning an intermediate reward model.
- Off-policy policy-gradient algorithms could enable further asynchronous training, such as Contrastive Policy Gradient (CoPG) [21] (a generalization of the direct alignment algorithm IPO and vanilla policy gradient), which was used by Cohere for their Command A model [22].
- Other implementations of REINFORCE algorithms have been designed for language models, such as ReMax [23], which implements a baseline normalization designed specifically to accommodate the sources of uncertainty from reward model inference.
- Some foundation models, such as Apple Intelligence Foundation Models [24] or Kimi k1.5 reasoning model [25], have used variants of Mirror Descent Policy Optimization (MDPO) [26]. Research is still developing further on the fundamentals here [27], but Mirror Descent is an optimization method rather than directly a policy gradient algorithm. What is important here is that it is substituted in very similarly to existing RL infrastructure.
- Decoupled Clip and Dynamic sAmpling Policy Optimization
(DAPO) proposes 4 modifications to GRPO to better suit
reasoning language models, where long traces are needed and new,
underutilized tokens need to be increased in probability [18]. The changes are: 1, have two
different clip hyperparameters,
and , so clipping on the positive side of the logratio can take bigger steps for better exploration; 2, dynamic sampling, which removes all samples with reward = 0 or reward = 1 for all samples in the batch (no learning signal); 3, use the per token loss as discussed above in Implementation: GRPO; and 4, a soft penalty on samples that are too long to avoid trying to learn from truncated answers. - Value-based Augmented Proximal Policy Optimization (VAPO) [28] combines optimizations from DAPO (including clip-higher, token level policy-gradient, and different length normalization) with insights from Value-Calibrated PPO [29] to pretrain the value function and length-adaptive GAE to show the promise of value base methods relative to GRPO.