Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Instruction Fine-tuning

Early large pretrained language models were trained with a next-token prediction objective and, by default, did not come with an explicit interface for following instructions. Around the release of GPT-3 [1], prompting and in-context learning became a widely used way to adapt a single model to many tasks (though task-specific fine-tuning remained common), by showing examples in-context and asking the model to complete a similar task. A practical next step was instruction fine-tuning, which teaches the model to respond in an instruction-response format rather than just continuing text.

Instruction fine-tuning took off when two lines of work converged. First, NLP shifted from bespoke-fine-tuning task setups to a unified “text-to-text” or instruction framing, which made it straightforward to standardize diverse datasets and train a single model across many tasks. Prominent examples of unifying the framework for tasks include Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5 models) [2], Finetuned Language Models Are Zero-Shot Learners (FLAN dataset) [3], Multitask Prompted Training Enables Zero-Shot Task Generalization (T0 models) [4], and Cross-Task Generalization via Natural Language Crowdsourcing Instructions (Natural Instructions dataset) [5]. Second, scaling pretrained LMs and the rise of prompting/in-context learning showed that a single model could generalize across tasks, but that generalization becomes far more reliable when the model is explicitly trained on instruction-response examples. Together, these trends led to an era of fine-tuning pretrained language models on large collections of instructions—what is now commonly called instruction fine-tuning (IFT), or supervised fine-tuning (SFT), in which training general models became accessible to wider audiences.

Since its discovery, instruction fine-tuning, also called colloquially just instruction tuning, has matured and is standard practice across many language modeling pipelines. At its core, IFT is the simplest method for adapting language models to a desired task distribution. It serves as the foundation for RLHF by preparing the model for a format of instructions that is known as question-answering, and it is the first tool used by those attempting to apply modern techniques to new domains. Without a basic level of instruction-following abilities, most of the pipelines we discuss in this book—from preference data collection to online RLHF optimization—cannot be performed.

Chat templates and the structure of instructions

The beginning of the post-training process is defining a pattern to format user queries so that they are easily readable by a language model that processes information through a tokenizer. When using a pretrained language model, the prompting is quite simple, the model only knows a few tokens: a beginning-of-sequence token (e.g., <bos_token>), an end-of-sequence token (e.g., <eos_token>), and a padding token (to manage training on batches with empty components). This means, to prompt a base model, the user inputs a sequence of tokens for the model to continue from, such as:

<bos_token> The capital of the United States is

Then, the model would generate tokens until it runs out of its context window, or it generates the end-of-sequence token.

All post-training stages, from instruction tuning to RLHF and other methods, rely on this formatting to train the model. The tool that handles the structure of the interaction with the user is called the chat template.

An example which we will break down is below:

{% if messages[0]['role'] == 'system' %}
    {# If the conversation begins with a system message, treat it as a special first turn.
       We set an offset so the user/assistant alternation check lines up correctly. #}
    {% set offset = 1 %}
{% else %}
    {# No system message: user should be the first non-empty turn. #}
    {% set offset = 0 %}
{% endif %}

{# Emit the beginning-of-sequence token (model-specific). #}
{{ bos_token }}

{# Serialize each message into the model's chat-markup tokens. #}
{% for message in messages %}
    {# Enforce role alternation: (system), user, assistant, user, assistant, ...
       The boolean expression compares "is this a user message?" against whether the
       current index (plus offset) is expected to be user or assistant. #}
    {% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}
        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
    {% endif %}

    {# Wrap each message with special tokens:
       - <|im_start|><role>\n
       - message content (trimmed)
       - <|im_end|>\n
       This produces a single flat token sequence the LM can train on. #}
    {{ '<|im_start|>' + message['role'] + '\n' + message['content'] | trim + '<|im_end|>\n' }}
{% endfor %}

{# Optionally append an "assistant" start tag with no content.
   This cues generation to continue from the assistant role. #}
{% if add_generation_prompt %}
    {{ '<|im_start|>assistant\n' }}
{% endif %}

This is the raw code for transforming a list of dictionaries in Python containing messages and roles into tokens that a language model can predict from.

All information passed into models is assigned a role. The traditional three roles are system, user, and assistant.

The system tag is only used for the first message of the conversation; it holds instructions for the agent in text that will not be received from or exposed to the user. These system prompts are used to provide additional context to the models, such as the date and time, or to patch behaviors. As a fun example, models can be told things such as “You are a friendly chatbot who always responds in the style of a pirate.”

Next, the two other roles are straightforward: user holds the messages from the person using the AI, and assistant holds the responses from the model (that is engaging as an AI assistant).

In order to translate all this information into tokens, we use the code listing above that we started with. The model has a series of special tokens that separate the various messages from each other. If we run the above code with the example query “How many helicopters can a human eat in one sitting?”, the token sequence passed into the model would look as follows:

<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant

Notice how the final tokens in the sequence are <|im_start|>assistant. This is how the model knows to continue generating tokens until it finally generates its end-of-sequence token, which in this case is <|im_end|>.

By packing all question-answer pair data (and downstream preference tuning data) into this format, modern language models follow it with perfect consistency. This is the language that instruction tuned models use to exchange information with users and the models stored on GPUs or other computing devices.

The behavior can be extended naively to multiple turns, such as shown below:

<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant
Oh just 6.<|im_end|>
<|im_start|>user
Are you sure about that?<|im_end|>
<|im_start|>assistant

In the open ecosystem, the standard method for applying the chat template to a list of messages is a piece of Jinja code saved in the tokenizer, as apply_chat_template.

The above chat template is a derivative of OpenAI’s Chat Markup Language (ChatML), which was an early attempt to standardize message formatting. Now, OpenAI and other model providers use a hierarchical system where the user can configure a system message, yet there are higher-level instructions that may or may not be revealed to the user [6].

Many other chat templates exist. Some other examples include Zephyr’s [7]:

<|system|>
You are a friendly chatbot who always responds in the style of a pirate</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>

Or Tülu’s:

<|user|>
How are you doing?
<|assistant|>
I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>

Beyond this, many chat templates include formatting and other tokens for tasks such as tool-use.

Best practices of instruction tuning

Instruction tuning as the foundation of post-training and creating helpful language models is well-established. There are many ways to achieve successful instruction tuning. For example, efficient fine-tuning with quantization of some model parameters makes training very accessible [8]. Also, in narrow domains such as chat alignment, i.e., without harder skills such as math or code, small, focused datasets can achieve strong performance [9].

Soon after the release of ChatGPT, human datasets with as few as 10K samples such as No Robots were state-of-the-art [10]. Years later, large-scale synthetic datasets work best [11] on most tasks.

A few principles remain:

Implementation

While the loss function is the same as pretraining, there are a few key implementation details that differ from the setting used for pretraining. Many practices, such as deciding on the types of parallelism used to shard models across many GPUs are the same as pretraining, just the total number of machines used is often lower (for the first technical change listed below):

Bibliography

[1]
T. Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[2]
C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
[3]
J. Wei et al., “Finetuned language models are zero-shot learners,” in International conference on learning representations, 2022. Available: https://openreview.net/forum?id=gEZrGCozdqR
[4]
V. Sanh et al., “Multitask prompted training enables zero-shot task generalization,” in International conference on learning representations, 2022. Available: https://openreview.net/forum?id=9Vrb9D0WI4
[5]
S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi, “Cross-task generalization via natural language crowdsourcing instructions,” in Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: Long papers), Association for Computational Linguistics, May 2022, pp. 3470–3487. doi: 10.18653/v1/2022.acl-long.244.
[6]
E. Wallace, K. Xiao, R. Leike, L. Weng, J. Heidecke, and A. Beutel, “The instruction hierarchy: Training llms to prioritize privileged instructions,” arXiv preprint arXiv:2404.13208, 2024.
[7]
L. Tunstall et al., “Zephyr: Direct distillation of LM alignment,” in First conference on language modeling, 2024. Available: https://openreview.net/forum?id=aKkAwZB6JV
[8]
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” Advances in neural information processing systems, vol. 36, pp. 10088–10115, 2023.
[9]
C. Zhou et al., “Lima: Less is more for alignment,” Advances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023.
[10]
N. Rajani, L. Tunstall, E. Beeching, N. Lambert, A. M. Rush, and T. Wolf, “No robots,” Hugging Face repository. https://huggingface.co/datasets/HuggingFaceH4/no_robots; Hugging Face, 2023.
[11]
N. Lambert et al., “Tulu 3: Pushing frontiers in open language model post-training,” arXiv preprint arXiv:2411.15124, 2024.
[12]
T. OLMo et al., “2 OLMo 2 furious,” arXiv preprint arXiv:2501.00656, 2024.
← Previous: Regularization Next: Rejection Sampling →