Regularization
Throughout the RLHF optimization, many regularization steps are used to prevent over-optimization
KL Distances
Reference Policy
Reference Dataset
Likelihood Penalty
- https://arxiv.org/abs/2404.19733 on DPO loss
Reward Bonuses
- Nemotron
Margin Losses
- Llama 2
- Rebel
- Reward Preference Optimization (Nemotron)