Key Related Works
In this chapter we detail the key papers and projects that got the RLHF field to where it is today. This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today. It is intentionally focused on recent work that lead to ChatGPT. There is substantial further work in the RL literature on learning from preferences [1]. For a more exhaustive list, you should use a proper survey paper [2],[3].
Origins to 2018: RL on Preferences
The field was recently popularized with the growth of Deep Reinforcement Learning and has grown into a broader study of the applications of LLMs from many large technology companies. Still, many of the techniques used today are deeply related to core techniques from early literature on RL from preferences.
TAMER: Training an Agent Manually via Evaluative Reinforcement, Proposed a learned agent where humans provided scores on the actions taken iteratively to learn a reward model [4]. Other concurrent or soon after work proposed an actor-critic algorithm, COACH, where human feedback (both positive and negative) is used to tune the advantage function [5].
The primary reference, Christiano et al. 2017, is application of RLHF applied on preferences between Atari trajectories [6]. The work shows that humans choosing between trajectories can be more effective in some domains than directly interacting with the environment. This uses some clever conditions, but is impressive nonetheless. This method was expanded upon with more direct reward modeling [7]. TAMER was adapted to deep learning with Deep TAMER just one year later [8].
This era began to transition as reward models as a general notion were proposed as a method for studying alignment, rather than just a tool for solving RL problems [9].
2019 to 2022: RL from Human Preferences on Language Models
Reinforcement learning from human feedback, also referred to regularly as reinforcement learning from human preferences in its early days, was quickly adopted by AI labs increasingly turning to scaling large language models. A large portion of this work began between GPT-2, in 2018, and GPT-3, in 2020. The earliest work in 2019, Fine-Tuning Language Models from Human Preferences has many striking similarities to modern work on RLHF [10]. Learning reward models, KL distances, feedback diagrams, etc – just the evaluation tasks, and capabilities, were different. From here, RLHF was applied to a variety of tasks. The popular applications were the ones that worked at the time. Important examples include general summarization [11], recursive summarization of books [12], instruction following (InstructGPT) [13], browser-assisted question-answering (WebGPT) [14], supporting answers with citations (GopherCite) [15], and general dialogue (Sparrow) [16].
Aside from applications, a number of seminal papers defined key areas for the future of RLHF, including those on:
- Reward model over-optimization [17]: The ability for RL optimizers to over-fit to models trained on preference data,
- Language models as a general area of study for alignment [18], and
- Red teaming [19] – the process of assessing safety of a language model.
Work continued on refining RLHF for application to chat models. Anthropic continued to use it extensively for early versions of Claude [20] and early RLHF open-source tools emerged [21],[22],[23].
2023 to Present: ChatGPT Eta
Since OpenAI launched ChatGPT [24], RLHF has been used extensively in leading language models. It is well known to be used in Anthropic’s Constitutional AI for Claude [25], Meta’s Llama 2 [26] and Llama 3 [27], Nvidia’s Nemotron [28], and more.
Today, RLHF is growing into a broader field of preference fine-tuning (PreFT), including new applications such as process reward for intermediate reasoning steps [29], direct alignment algorithms inspired by Direct Preference Optimization (DPO) [30], learning from execution feedback from code or math [31],[32], and other online reasoning methods inspired by OpenAI’s o1 [33].