The Basics of Reinforcement Learning from Human Feedback

Chapter Contents

[Incomplete] Policy Gradient Algorithms

The algorithms that popularized RLHF for language models were policy-gradient reinforcement learning algoritms. These algorithms, such as PPO and Reinforce, use recently generated samples to update their model rather than storing scores in a replay buffer. In this section we will cover the fundamentals of the policy gradient algorithms and how they are used in the modern RLHF framework.

For definitions of symbols, see the problem setup chapter.

Policy Gradient Algorithms

The core of policy gradient algorithms is computing the gradient with respect to the finite time expected return over the current policy. With this expected return, \(J\), the gradient can be computed as follows, where \(\alpha\) is the learning rate:

\[\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta)\]

Vanilla Policy Gradient

The vanilla policy gradient implementation optimizes the following expectation:

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_\tau \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) A^{\pi_\theta}(s_t, a_t) \right]\]

Reinforce

Reinforce is a specific implementation of vanilla policy gradient that uses a Monte Carlo estimator of the gradient. [1] ### Proximal Policy Optimization

Computing Policy Gradients with a Language Model

Implementation Tricks

TODO. Cite: https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#

https://lilianweng.github.io/posts/2018-04-08-policy-gradient/

KL Controllers

TODO: adaptive vs static KL control

See tAble 10 for impelementation details in tulu 2.5 paper

Bibliography

[1]
A. Ahmadian et al., “Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms,” arXiv preprint arXiv:2402.14740, 2024.