Nathan Lambert
Course on RLHF and post-training. Chapters 4, 5 & 9
These chapters are the simpler tools for getting started in RLHF & post-training. The core lectures will be on RL, but we need to build foundations first.
These chapters are the simpler tools for getting started in RLHF & post-training. The core lectures will be on RL, but we need to build foundations first.
Plus, these chapters form a natural pipeline — the simplest complete path from a pretrained model to a preference-tuned one (with a reward model). A natural step before full RLHF:
Instruction fine-tuning (IFT), also called supervised fine-tuning (SFT), is the first post-training step.

Instruction tuning emerged from two parallel research threads:
Instruction tuning emerged from two parallel research threads:
SFT took off after ChatGPT: Alpaca (Taori et al., 2023), OpenAssistant (K{\"o}pf et al., 2023), Tulu (Wang et al., 2023), and many others helped turn instruction tuning into a broadly reproducible open recipe
The model needs a structured format to manage who is speaking and what to generate. Early chat templates defined three roles:
<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant
The model generates until it produces an end-of-text token (in this case, it is <|im_end|>).
Conversation object:
How most post-training data is stored (model-agnostic):
messages = [
{
"role": "system",
"content": "You are a friendly chatbot who always responds in the style of a pirate",
},
{
"role": "user",
"content": "How many helicopters can a human eat in one sitting?",
},
]
Rendered chat template text:
What the data looks like to a model (tokenizer-specific):
<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant
tokenizer.apply_chat_template(messages) performs this conversion before tokenization.
What the model sees:
<|im_start|>system
You are a friendly chatbot who always
responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat
in one sitting?<|im_end|>
<|im_start|>assistant
What the user sees:
You are a friendly chatbot who always responds in the style of a pirate
How many helicopters can a human eat in one sitting?
...
Chat templates are implemented as Jinja code snippets stored in the tokenizer config. This is the raw code that converts a list of Python dicts into the token sequence the model sees:
{{ bos_token }}
{% for message in messages %}
{{ '<|im_start|>' + message['role'] + '\n'
+ message['content'] | trim + '<|im_end|>\n' }}
{% endfor %}
{% if add_generation_prompt %}
{{ '<|im_start|>assistant\n' }}
{% endif %}
The full template also enforces role alternation (user/assistant/user/…) and handles the optional system message.
Applied in code via tokenizer.apply_chat_template(messages).
For example, see Olmo-3-7B-Instruct’s.
oof.
There are many ways that working with chat templates is difficult.
Different model families use different special tokens, but the structure is the same.
Zephyr (Tunstall et al., 2024):
<|system|>
You are a friendly chatbot...</s>
<|user|>
How many helicopters can a human eat in one sitting?</s>
<|assistant|>
Tulu (Lambert et al., 2024):
<|user|>
How are you doing?
<|assistant|>
I'm just a computer program, so I don't have feelings, but I'm
functioning as expected. How can I assist you today?<|endoftext|>
These are applied via Jinja templates stored in the tokenizer config (apply_chat_template).
Conversations extend naturally by alternating roles:
<|im_start|>system
You are a friendly chatbot who always responds in the style of a pirate<|im_end|>
<|im_start|>user
How many helicopters can a human eat in one sitting?<|im_end|>
<|im_start|>assistant
Oh just 6.<|im_end|>
<|im_start|>user
Are you sure about that?<|im_end|>
<|im_start|>assistant
The entire history is packed into one token sequence — the model sees all prior turns as context when generating.
OpenAI released Harmony alongside gpt-oss, replacing Jinja with a Rust-based renderer that separates output into channels:
analysis — internal reasoning / chain-of-thought (hidden from user)commentary — tool calls go herefinal — user-facing response<|start|>assistant<|channel|>analysis<|message|>I need to check the weather...<|end|>
<|start|>assistant<|channel|>commentary to=functions.get_weather
<|constrain|>json<|message|>{"location":"SF"}<|call|>
<|start|>assistant<|channel|>final<|message|>It's 65°F and sunny in SF.<|return|>
Why? Jinja can’t cleanly handle tool calls (tojson escaping, ambiguous boundaries). Harmony moves the complexity into a dedicated library (openai-harmony on PyPI) instead of a template string.
from openai_harmony import (Role, Message, Conversation,
SystemContent, load_harmony_encoding, HarmonyEncodingName)
# Build messages
system = Message.from_role_and_content(Role.SYSTEM, SystemContent.new())
user = Message.from_role_and_content(Role.USER, "What is 2 + 2?")
# Assemble a conversation
convo = Conversation.from_messages([system, user])
# Render to tokens using the OSS encoding
enc = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
tokens = enc.render_conversation_for_completion(convo, Role.ASSISTANT)
# Decode back to text
print(enc.decode_utf8(tokens))
The same render_conversation_for_completion / render_conversation_for_training split replaces add_generation_prompt from Jinja templates.
Prompt masking: during IFT, the model only learns to predict assistant responses, not user messages.
For multi-turn conversations, two strategies:
Data quality matters more than quantity:
Training details differ from pretraining:
The amount of instruction data needed has evolved rapidly:
Scaling the prompts quickly enabled more performance. Now, reasoning models are using more compute and tokens to train via more tokens per prompt in SFT (and longer context lengths).
Successful post-training starts with meaningful evaluations for targeted skills and prompts of representative queries for those skills.
All post-training stages require prompts in distribution of tasks. Example prompt budgets from (Lambert et al., 2024):
Large variance on these numbers is possible — but the key point is that prompts are the starting material for every stage. Recent work has of course scaled up various RL training stages :D!
Synthetic data has become the dominant approach for building SFT datasets – following (Wang et al., 2023):
Quality of responses is the simpler part — strong models (e.g. GPT-4o, Llama 3.1 405B) generate good completions to most instructions.
Human data is still needed for out-of-distribution or novel tasks. At the time of recording, this is “knowledge work” tasks like healthcare/law.
Two repeated and parallelizable tracks:
Data mixing:
Data curation:
These two tracks iterate: mix what you have, evaluate, curate what’s missing, mix again.
In RLHF, the reward model plays the role of the environment — it returns a reward signal that tells the policy how well it did.
The key difference from standard RL: in RLHF, we get to control and learn this reward function from human preferences, rather than having it fixed by the environment.
A reward model compresses complex, subjective human judgments into a single scalar score.

A reinforcement learning problem is often written as a Markov Decision Process (MDP):

Classical RL
RLHF

A human (or AI) annotator sees two responses to the same prompt and picks the better one — this preference pair becomes training data for the reward model.
Explain why the sky is blue in one sentence.
The sky is blue due to Rayleigh scattering, where shorter blue wavelengths of sunlight are scattered more by atmospheric molecules than longer wavelengths.
Explain why the sky is blue in one sentence.
The sky appears blue because of the way light interacts with the atmosphere and stuff, it's basically just physics.
A probability model is a mathematical form that we assume matches how real judgments work — then we fit its parameters to data. The canonical reward model uses the Bradley-Terry model (1952).
A probability model is a mathematical form that we assume matches how real judgments work — then we fit its parameters to data. The canonical reward model uses the Bradley-Terry model (1952). Given two items i and j, the probability that a judge prefers i over j:
Each item has a latent strength p_i > 0. Reparametrizing with p_i = e^{r_i} lets us work with unbounded scores r_i \in \mathbb{R} — which is what a neural network naturally outputs:
Only score differences matter — adding the same constant to all scores leaves preferences unchanged.
This is not the only possible model — but it’s what worked in early RLHF and has stuck. It’s a useful approximation, not a law of nature.
Given a prompt x, a chosen completion y_c, and a rejected completion y_r, we score both with a reward model r_\theta:
We want to find \theta that maximizes this probability — i.e. the model should assign a higher score to the completion humans preferred.
Start from maximizing the preference probability:
Start from maximizing the preference probability:
Divide numerator and denominator by \exp(r_\theta(y_c \mid x)):
Start from maximizing the preference probability:
Divide numerator and denominator by \exp(r_\theta(y_c \mid x)):
The numerator becomes 1, and in the denominator \frac{\exp(r_\theta(y_r \mid x))}{\exp(r_\theta(y_c \mid x))} = \exp(r_\theta(y_r \mid x) - r_\theta(y_c \mid x)):
Start from maximizing the preference probability:
Divide numerator and denominator by \exp(r_\theta(y_c \mid x)):
Rewrite as \frac{1}{1 + \exp(-(r_\theta(y_c \mid x) - r_\theta(y_r \mid x)))}. Recognize the sigmoid \sigma(z) = \frac{1}{1 + e^{-z}}:
\log is monotonic, so \arg\max \sigma(\cdot) = \arg\max \log \sigma(\cdot):
Flip the sign to turn maximization into minimization (a loss):
The first form, as in InstructGPT (Ouyang et al., 2022):
The second form, as in Anthropic’s work (Askell et al., 2021):
These are equivalent: using \sigma(\Delta) = \frac{1}{1 + e^{-\Delta}} gives -\log\sigma(\Delta) = \log(1 + e^{-\Delta}).
Both are just binary cross-entropy — the reward model is learning to classify which completion was preferred.
The model computes a scalar score at the EOS token for each completion. The contrastive loss depends only on the score difference between chosen and rejected.

The most common implementation: append a linear head to a language model that outputs a single scalar.
class BradleyTerryRewardModel(nn.Module):
def __init__(self, base_lm):
super().__init__()
self.lm = base_lm
self.head = nn.Linear(self.lm.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
hidden = self.lm(
input_ids=input_ids, attention_mask=attention_mask,
output_hidden_states=True, return_dict=True,
).hidden_states[-1]
lengths = attention_mask.sum(dim=1) - 1
batch_idx = torch.arange(hidden.size(0), device=hidden.device)
seq_repr = hidden[batch_idx, lengths]
return self.head(seq_repr).squeeze(-1)
Given the model above, the training loss is just three lines:
rewards_chosen = model(**inputs_chosen)
rewards_rejected = model(**inputs_rejected)
loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
Practical note: reward models are typically trained for only 1 epoch to avoid overfitting to the preference data.
When annotators provide Likert scale ratings (e.g. 1-5), the magnitude of the preference can inform training. Llama 2 proposed a margin term m(y_c, y_r):
For example, if chosen scores 5 and rejected scores 2, then m = 3.
This encourages the model to produce larger score gaps for strongly preferred pairs.
Note: Llama 3 removed the margin term — the team observed diminishing improvements at scale.
InstructGPT balances multiple comparisons per prompt to prevent overfitting (and included all prompts in the same batch at training time to make this work):
K-wise loss (Plackett-Luce model, used in Starling 7B/34B) handles full rankings over K completions:
When K = 2, this reduces to Bradley-Terry.
For reasoning tasks, we often have verifiable correctness. ORMs train a per-token head using the completion-level correctness label repeated across completion tokens:
where r \in \{0,1\} is the completion-level correctness label, broadcast across completion tokens during training.
Key differences from Bradley-Terry: no pairwise comparisons needed — just correct/incorrect labels per response. And the model outputs a per-token probability of correctness, not a single score at the EOS token.
Bradley-Terry RM — score at the last token:
hidden = lm(**inputs,
output_hidden_states=True
).hidden_states[-1]
# Extract EOS token representation
lengths = attention_mask.sum(dim=1) - 1
batch_idx = torch.arange(hidden.size(0))
seq_repr = hidden[batch_idx, lengths]
# Single scalar per completion
score = head(seq_repr).squeeze(-1)
# shape: (batch,)
ORM — score at every token:
hidden = lm(**inputs,
output_hidden_states=True
).hidden_states[-1]
# Score every token position
logits = head(hidden).squeeze(-1)
# shape: (batch, seq_len)
Both start from the same base LM hidden states — the difference is where the head is applied.
Bradley-Terry RM loss — contrastive, needs pairs:
score_chosen = model(**inputs_chosen)
score_rejected = model(**inputs_rejected)
loss = -F.logsigmoid(
score_chosen - score_rejected
).mean()
One scalar per completion. Loss depends on the difference between chosen and rejected.
ORM loss — per-token, needs only labels:
logits = model(**inputs)
# shape: (batch, seq_len)
mask = labels != -100
loss = F.binary_cross_entropy_with_logits(
logits[mask], labels[mask].float()
)
One score per token. Every completion token is trained against the same outcome label (r = 0 or 1).
The outcome label (r = 0 or 1) is broadcast to every completion token. Prompt tokens are masked. The model learns per-token correctness predictions from this repeated signal. In a larger batch of training, there is a diverse signal of supervision to learn from.

At inference time, the ORM outputs a probability of correctness at every token. These token-level scores can then be aggregated into a response-level verifier score for filtering or reranking.

Some papers use “outcome reward model” to mean a Bradley-Terry model trained on correct vs. incorrect completions. This conflates two different architectures:
| BT RM (on correct/incorrect) | ORM (Cobbe et al.) | |
|---|---|---|
| Min. Training Input | Two completions (pair) | One completion + label |
| Output | Single scalar at EOS | Per-token probability |
| Loss | -\log\sigma(r_c - r_{ic}) | Per-token binary cross-entropy |
| Head | Score at last token | Score at every token |
A BT model trained on (correct, incorrect) pairs is still a preference RM — it just uses correctness as the preference signal. A true ORM has a fundamentally different input-output structure.
PRMs score intermediate reasoning steps, not just final outcomes:
Labels are applied only at step boundaries (e.g. double newlines): incorrect (-1), neutral (0), correct (+1). All other tokens are masked during training.

3-class head (not binary like ORM):
# head outputs 3 classes per token:
# 0 = incorrect, 1 = neutral, 2 = correct
head = nn.Linear(hidden_size, 3)
hidden = lm(**inputs,
output_hidden_states=True
).hidden_states[-1]
logits = head(hidden)
# shape: (batch, seq_len, 3)
Loss at step boundaries only:
# Labels: class index at step-end tokens,
# -100 everywhere else
mask = labels != -100
loss = F.cross_entropy(
logits[mask], labels[mask]
)
vs. ORM: binary labels at every completion token. vs. PRM: 3-class labels at step boundaries only.
| Model | Predicts | Trained on | Output |
|---|---|---|---|
| Preference RM | Quality at EOS token | Pairwise (chosen vs. rejected) | Single scalar |
| ORM | Outcome signal over completion tokens | Binary outcome labels | Per-token probability |
| PRM | Per-step correctness | Step-level annotations | Score at step boundaries |
| Value Function | Expected remaining return | On-policy rollouts | Per-token expected return |
Key distinction — ORM vs. Value Function:
Same architecture, different semantics and supervision pipeline. More on value functions in the policy gradient lecture(s).
An alternative to training a reward model: prompt an LLM to judge quality (prompt from (Zheng et al., 2023)).
[System]
Please act as an impartial judge and evaluate the quality of the
responses provided by two AI assistants to the user question below.
...
After providing your explanation, output your final verdict by strictly
following this format: "[[A]]" if assistant A is better, "[[B]]" if
assistant B is better, and "[[C]]" for a tie.
Spawned benchmarks: MT-Bench, AlpacaEval, Arena-Hard, WildBench.
Current status: generative RMs tend to underperform trained reward models on RM evaluations, but are cheaper to set up and used widely in evaluation and training pipelines. An evaluation/utility mismatch exists.
I very happily helped kickstart this area with the first one (Lambert et al., 2025)! The RM evaluation landscape has expanded rapidly:
The bulk of early RM research focused on establishing benchmarks and identifying behavior modes. Training innovations (aspect-conditioned models, high-quality human datasets, scaling experiments) are still an active area.
Rejection sampling is a popular baseline for preference fine-tuning. The idea:
No policy gradients, no online RL — just filtered supervised learning.
The name comes from computational statistics: sample from a simple distribution, use a heuristic to accept/reject samples, approximating a complex target distribution.
The four stages: 0. Select prompts and reward model (reuse IFT prompts or curate new ones)
Used in WebGPT (Nakano et al., 2021), Anthropic’s Helpful and Harmless (Bai et al., 2022), Llama 2 Chat (Touvron et al., 2023), and many other seminal works.
This is still just offline data curation: generate first, then train on the filtered outputs.
Given M prompts and N completions each:
Each reward: r_{i,j} = \mathcal{R}(y_{i,j} \mid x_i)
Select the highest-scoring completion for each prompt independently:
Example (5 prompts, 4 completions):
Result: one completion per prompt, every prompt represented.
Flatten the reward matrix and select the top K pairs globally:
Here, R_{\text{flat}} is the length-MN vector made by flattening the reward matrix. Each selected flat index is then mapped back to its original (prompt, completion) pair.
Same example, top 5 overall:
Now prompt 3 gets two completions (0.9 and 0.7), while prompt 5 gets none. This optimizes for absolute quality but can bias toward easy prompts.
The core hyperparameters are intuitive:
Practical tip: sort completions by length before batch RM inference to reduce padding token computation.
Open questions: how to sequence RS in a multi-stage pipeline, whether to use generations from multiple models, optimal prompt selection. There hasn’t been a fully open reproduction of this, which is kind of confusing as to why. My hunch is that training an reward model has some subtle tricks.
Best-of-N (BoN) follows the same generate-and-score procedure, but skips the fine-tuning step:
BoN is the simplest possible reward-guided method: generate more, pick the best. Can also be done with verification / LLM-as-a-judge instead of traditional reward models.
Putting it all together — a simple path from pretrained model to preference-tuned model:
This is a strong baseline.
Next lectures will cover more powerful optimization methods: PPO, which optimizes a policy against a learned reward signal, and DPO, which optimizes directly from preference pairs rather than filtering training data.