The Basics of Reinforcement Learning from Human Feedback

Chapter Contents

Key Related Works

In this chapter we detail the key papers and projects that got the RLHF field to where it is today. This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today. It is intentionally focused on recent work that lead to ChatGPT. There is substantial further work in the RL literature on learning from preferences [1]. For a more exhaustive list, you should use a proper survey paper [2],[3].

Origins to 2018: RL on Preferences

The field was recently popularized with the growth of Deep Reinforcement Learning and has grown into a broader study of the applications of LLMs from many large technology companies. Still, many of the techniques used today are deeply related to core techniques from early literature on RL from preferences.

TAMER: Training an Agent Manually via Evaluative Reinforcement, Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model [4]. Other concurrent or soon after work proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function [5].

The primary reference, Christiano et al. 2017, is application of RLHF applied on preferences between Atari trajectories [6]. The work shows that humans choosing between trajectories can be more effective in some domains than directly interacting with the environment. This uses some clever conditions, but is impressive nonetheless. This method was expanded upon with more direct reward modeling [7]. TAMER was adapted to deep learning with Deep TAMER just one year later [8].

This era began to transition as reward models as a general notion were proposed as a method for studying alignment, rather than just a tool for solving RL problems [9].

2019 to 2022: RL from Human Preferences on Language Models

Reinforcement learning from human feedback, also referred to regularly as reinforcement learning from human preferences in its early days, was quickly adopted by AI labs increasingly turning to scaling large language models. A large portion of this work began between GPT-2, in 2018, and GPT-3, in 2020. The earliest work in 2019, Fine-Tuning Language Models from Human Preferences has many striking similarities to modern work on RLHF [10]. Learning reward models, KL distances, feedback diagrams, etc – just the evaluation tasks, and capabilities, were different. From here, RLHF was applied to a variety of tasks. The popular applications were the ones that worked at the time. Important examples include general summarization [11], recursive summarization of books [12], instruction following (InstructGPT) [13], browser-assisted question-answering (WebGPT) [14], supporting answers with citations (GopherCite) [15], and general dialogue (Sparrow) [16].

Aside from applications, a number of seminal papers defined key areas for the future of RLHF, including those on:

  1. Reward model over-optimization [17]: The ability for RL optimizers to over-fit to models trained on preference data,
  2. Language models as a general area of study for alignment [18], and
  3. Red teaming [19] – the process of assessing safety of a language model.

Work continued on refining RLHF for application to chat models. Anthropic continued to use it extensively for early versions of Claude [20] and early RLHF open-source tools emerged [21],[22],[23].

2023 to Present: ChatGPT Eta

Since OpenAI launched ChatGPT [24], RLHF has been used extensively in leading language models. It is well known to be used in Anthropic’s Constitutional AI for Claude [25], Meta’s Llama 2 [26] and Llama 3 [27], Nvidia’s Nemotron [28], and more.

Today, RLHF is growing into a broader field of preference fine-tuning (PreFT), including new applications such as process reward for intermediate reasoning steps [29], direct alignment algorithms inspired by Direct Preference Optimization (DPO) [30], learning from execution feedback from code or math [31],[32], and other online reasoning methods inspired by OpenAI’s o1 [33].

Bibliography

[1]
C. Wirth, R. Akrour, G. Neumann, and J. Fürnkranz, “A survey of preference-based reinforcement learning methods,” Journal of Machine Learning Research, vol. 18, no. 136, pp. 1–46, 2017.
[2]
T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, “A survey of reinforcement learning from human feedback,” arXiv preprint arXiv:2312.14925, 2023.
[3]
S. Casper et al., “Open problems and fundamental limitations of reinforcement learning from human feedback,” arXiv preprint arXiv:2307.15217, 2023.
[4]
W. B. Knox and P. Stone, “Tamer: Training an agent manually via evaluative reinforcement,” in 2008 7th IEEE international conference on development and learning, IEEE, 2008, pp. 292–297.
[5]
J. MacGlashan et al., “Interactive learning from policy-dependent human feedback,” in International conference on machine learning, PMLR, 2017, pp. 2285–2294.
[6]
P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
[7]
B. Ibarz, J. Leike, T. Pohlen, G. Irving, S. Legg, and D. Amodei, “Reward learning from human preferences and demonstrations in atari,” Advances in neural information processing systems, vol. 31, 2018.
[8]
G. Warnell, N. Waytowich, V. Lawhern, and P. Stone, “Deep tamer: Interactive agent shaping in high-dimensional state spaces,” in Proceedings of the AAAI conference on artificial intelligence, 2018.
[9]
J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg, “Scalable agent alignment via reward modeling: A research direction,” arXiv preprint arXiv:1811.07871, 2018.
[10]
D. M. Ziegler et al., “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
[11]
N. Stiennon et al., “Learning to summarize with human feedback,” Advances in Neural Information Processing Systems, vol. 33, pp. 3008–3021, 2020.
[12]
J. Wu et al., “Recursively summarizing books with human feedback,” arXiv preprint arXiv:2109.10862, 2021.
[13]
L. Ouyang et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022.
[14]
R. Nakano et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.
[15]
J. Menick et al., “Teaching language models to support answers with verified quotes,” arXiv preprint arXiv:2203.11147, 2022.
[16]
A. Glaese et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
[17]
L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model overoptimization,” in International conference on machine learning, PMLR, 2023, pp. 10835–10866.
[18]
A. Askell et al., “A general language assistant as a laboratory for alignment,” arXiv preprint arXiv:2112.00861, 2021.
[19]
D. Ganguli et al., “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,” arXiv preprint arXiv:2209.07858, 2022.
[20]
Y. Bai et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
[21]
R. Ramamurthy et al., “Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization,” arXiv preprint arXiv:2210.01241, 2022.
[22]
A. Havrilla et al., “TrlX: A framework for large scale reinforcement learning from human feedback,” in Proceedings of the 2023 conference on empirical methods in natural language processing, Singapore: Association for Computational Linguistics, Dec. 2023, pp. 8578–8595. doi: 10.18653/v1/2023.emnlp-main.530.
[23]
L. von Werra et al., “TRL: Transformer reinforcement learning,” GitHub repository. https://github.com/huggingface/trl; GitHub, 2020.
[24]
OpenAI, “ChatGPT: Optimizing language models for dialogue.” https://openai.com/blog/chatgpt/, 2022.
[25]
Y. Bai et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.
[26]
H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
[27]
A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
[28]
B. Adler et al., “Nemotron-4 340B technical report,” arXiv preprint arXiv:2406.11704, 2024.
[29]
H. Lightman et al., “Let’s verify step by step,” arXiv preprint arXiv:2305.20050, 2023.
[30]
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[31]
A. Kumar et al., “Training language models to self-correct via reinforcement learning,” arXiv preprint arXiv:2409.12917, 2024.
[32]
A. Singh et al., “Beyond human data: Scaling self-training for problem-solving with language models,” arXiv preprint arXiv:2312.06585, 2023.
[33]
OpenAI, “Introducing OpenAI o1-preview.” Sep. 2024. Available: https://openai.com/index/introducing-openai-o1-preview/