Reinforcement Learning from Human Feedback Basics

Chapter Contents

Regularization

Throughout the RLHF optimization, many regularization steps are used to prevent over-optimization

KL Distances

Reference Policy

Reference Dataset

Likelihood Penalty

https://arxiv.org/abs/2404.19733 on DPO loss

Reward Bonuses

Nemotron

Margin Losses

Llama 2
Rebel
Reward Preference Optimization (Nemotron)