Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Product, UX, and Model Character

Frontiers in RLHF and post-training show how these techniques are used within companies to make leading products. As RLHF becomes more established, the problems it is used to address are becoming more nuanced. In this chapter, we discuss a series of use-cases that leading AI laboratories consider RLHF and post-training for that are largely unstudied in the academic literature.

Character Training

Character training is the subset of post-training designed around crafting traits within the model in the manner of its response, rather than the content. Character training, while being important to the user experience within language model chatbots, is effectively unstudied in the public domain.

We don’t know the trade-offs of what character training does, we don’t know how exactly to study it, we don’t know how much it can improve user preferences on ChatBotArena, and we should. What we do know is that character training uses the same methods discussed in this book, but for much more precise goals on the features in the language used by the model. Character training involves extensive data filtering and synthetic data methods such as Constitutional AI that are focusing on the manner of the model’s behavior. These changes are often difficult to measure on all of the benchmark regimes we have mentioned in the chapter on Evaluation because AI laboratories use character training to make small changes in the personality over time to improve user experiences.

For example, Character Training was added by Anthropic to its Claude 3 models [1]:

Claude 3 was the first model where we added “character training” to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.

In the following months, stronger character emerged across the industry of models. The process is extremely synthetic data-heavy, but requires an artist’s touch, as stated later in the blog post: It “relies on human researchers closely checking how each trait changes the model’s behavior.”

Character training being the focus of developments is the strongest endorsement that RLHF and related approaches have shifted from their philosophical motivations of alignment to being primarily an empirical tool. The models can capture so many different behaviors, but getting them to reliably behave how we want is the hardest part. Right now, it seems more likely that this is about capturing the upside of RLHF as a performance tool, rather than a safety one.

One of the few public discussions of character training came from Amanda Askell during her appearance on the Lex Fridman Podcast (taken from the transcript):

Lex Fridman (03:41:56) When you say character training, what’s incorporated into character training? Is that RLHF or what are we talking about?

Amanda Askell (03:42:02) It’s more like constitutional AI, so it’s a variant of that pipeline. I worked through constructing character traits that the model should have. They can be shorter traits or they can be richer descriptions. And then you get the model to generate queries that humans might give it that are relevant to that trait. Then it generates the responses and then it ranks the responses based on the character traits. In that way, after the generation of the queries, it’s very much similar to constitutional AI, it has some differences. I quite like it, because it’s like Claude’s training in its own character, because it doesn’t have any… It’s like constitutional AI, but it’s without any human data.

In summary, Anthropic uses the same techniques they use for Constitutional AI and general post-training for capabilities to train these models’ characters.

Model Specifications

OpenAI recently shared what they call their “Model Spec” [2], a document that details their goal model behaviors prior to clicking go on a fine-tuning run. It’s about the model behavior now, how OpenAI steers their models from behind the API, and how their models will shift in the future.

Model Spec’s are one of the few tools in the industry and RLHF where one can compare the actual behavior of the model to what the designers intended. As we have covered in this book, training models is a complicated and multi-faceted process, so it is expected that the final outcome differs from inputs such as the data labeler instructions or the balance of tasks in the training data. For example, a Model Spec is much more revealing than a list of principles used in Constitutional AI because it speaks to the intent of the process rather than listing what acts as intermediate training variables.

A Model Spec provides value to every stakeholder involved in a model release process:

Product Cycles, UX, and RLHF

As powerful AI models become closer to products than singular artifacts of an experiment machine learning process, RLHF has become an interface point for the relationship between models and product. Much more goes into making a model easy to use than just having the final model weights be correct – fast inference, suitable tools to use (e.g. search or code execution), a reliable and easy to understand user interface (UX), and more. RLHF research has become the interface where a lot of this is tested because of the framing where RLHF is a way to understand the user’s preferences to products in real time and because it is the final training stage before release. The quickest way to add a new feature to a model is to try and incorporate it at post-training where training is faster and cheaper. This cycle has been seen with image understanding, tool use, better behavior, and more. What starts as a product question quickly becomes and RLHF modeling question, and if it is successful there it backpropagates to other earlier training stages.

Bibliography

[1]
Anthropic, “Claude’s character.” 2024. Available: https://www.anthropic.com/research/claude-character
[2]
OpenAI, “Introducing the model spec.” May 2024. Available: https://openai.com/index/introducing-the-model-spec/
← Previous: Style & Information