Nathan Lambert
Course on RLHF and post-training. Chapters 1-3
Core properties:

Modern language models:











After pretraining we are left with a glorified autocomplete model, for example:1
The president of the United States in 2006 was
George W. Bush, the governor of Florida in 2006 was Jeb Bush, and John McCain was an Arizona senator in 2006...
The earliest forms of modern post-trained (or RLHF-tuned) models shifted the continuation format to always conforming to the “answering a question style.” An example of what early conversational models looked like is below:
The president of the United States in 2006 was
George W. Bush was the president of the United States in 2006. He served two terms in office, from January 20, 2001, to January 20, 2009.
How the different training stages change the model:
How the different training stages change the model:
What this means for post-training:
Model responses evolved quickly to have:
I'm giving a talk on RLHF tomorrow. Can you help me structure it?
Absolutely — here's a simple structure:
1. Start with the basics
2. Explain post-training
3. End with why it matters
A reinforcement learning problem is often written as a Markov Decision Process (MDP):

Reinforcement learning basics:

The agent learns over many episodes when to turn the heater on or off
Example policy:

State: cart position, velocity, pole angle, angular velocity
Action: push the cart left or right
Reward: +1 for every step the pole stays upright

Each action changes the physics of the system. The full state update:
Where m_c is the cart mass, m_p is the pole mass, l is the pole length, g is gravity, and F is the applied force.
This is why classical RL is a multi-step control problem — each action changes the next state, and rewards accumulate across a trajectory.

In classical RL, the reward function is known — CartPole gives +1 per step. But for many tasks the reward is hard to write down:
RLHF lets us optimize for behavior we can evaluate even when we cannot easily specify the reward.






A human (or AI) annotator sees two responses to the same prompt and picks the better one — this preference pair becomes training data for the reward model.
Explain why the sky is blue in one sentence.
The sky is blue due to Rayleigh scattering, where shorter blue wavelengths of sunlight are scattered more by atmospheric molecules than longer wavelengths.
Explain why the sky is blue in one sentence.
The sky appears blue because of the way light interacts with the atmosphere and stuff, it's basically just physics.
Classical RL
RLHF


The foundation of post-training. Also called Supervised Fine-tuning (SFT):
You are a helpful, harmless assistant. A system message like this can be used to steer the model to specific personas or behaviors.
Write me a short poem about an optimistic goldfish.
Bright little goldfish Finds a sunrise in each wave Small bowl, endless hope
Overview:
The probability model says a response should win when it gets a higher reward score:
Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:
Notation:
The reward used in RLHF is the model predicting the probability that a given piece of text would be the "winning" or "chosen" completion in a pair/batch. Clever!
The probability model says a response should win when it gets a higher reward score:
Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:
Notation:
Where everything comes together (and RLHF gets its name):

The reference model \pi_{\mathrm{ref}} keeps the policy anchored to the SFT model.
D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right) measures how far the new policy moves from that reference on prompt x.
\beta controls the tradeoff between improving behavior and staying close to what the model already knows.
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO)
Generate many completions, score them, fine-tune on the best:
Simple, stable, and widely used: Llama 2 (Touvron et al., 2023) and DeepSeek R1 (Guo et al., 2025) both include rejection sampling stages.
| Rejection Sampling | Online RL (PPO) | DPO | |
|---|---|---|---|
| Mechanism | Filter, then SFT | Generate, score, update policy | Direct gradient on preferences |
| Reward model | Required | Required | Implicit (no separate RM) |
| On-policy data | Yes (generate from current model) | Yes (generate each step) | No (fixed preference dataset) |
| Complexity | Low | High | Low |
All three optimize the same underlying objective — they differ in how they move the policy toward higher-reward completions. There is substantial debate on which of these is the best for final performance, which RL generally wins, but evidence is mixed.
The reward model is a proxy, not ground truth. Even a well-trained RM is only correlated with real user satisfaction.
Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
What this looks like in practice:
The KL penalty \beta is the main defense — it limits how far the policy can drift from the reference model. But over-optimization is a fundamental tension in all preference-based training.
| InstructGPT (2022) | Tülu 3 (2024) | DeepSeek R1 (2025) | |
|---|---|---|---|
| Instruction data | ~10K | ~1M | 100K+ |
| Preference data | ~100K | ~1M | On-policy |
| RL stage | ~100K prompts | ~10K (RLVR) | N/A |
An overall trend is to use far more compute across all the stages, but shifting more to RLVR.

Early on, RLHF had a well-documented, simple enough approach.

What began as an “RLHF” recipe evolved into a complex series of steps to get the final, best model (e.g. Nemotron 4 340B, Llama 3.1).
As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):
As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):
Within 2024 the field shifted its focus to post-training, as training stages evolved beyond the InstructGPT-style recipe, DPO proliferated, and largely RLHF was viewed as one tool (that you may not even need).
RLHF’s reputation was that its contributions are minor on the final language models.
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”
LIMA: Less Is More for Alignment (2023)
RLHF’s reputation was that its contributions are minor on the final language models.
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”
LIMA: Less Is More for Alignment (2023)
Sometimes this view of alignment (or RLHF) teaching “format” made people think that post-training only made minor changes to the model. This would describe finetuning as “just style transfer.”
The base model trained on trillions of tokens of web text has seen and learned from an extremely broad set of examples. The model at this stage contains far more latent capability than early post-training recipes were able to expose.
The question is: How does post-training interact with these?
RLHF’s reputation was that its contributions are minor on the final language models.
An example, OLMoE — same base model family, updated only post-training:
OLMoE-1B-7B-0924-Instruct (Sep. 2024): 38.44 avg. eval scoreOLMoE-1B-7B-0125-Instruct (Jan. 2025): 45.62 avg. eval scoreBase models determine the ceiling. Post-training’s job has been to reach it.
Simple post-training often doesn’t extract nearly enough performance (especially when the pace of progress is high).
RLHF’s reputation was that its contributions are minor on the final language models.
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”
LIMA: Less Is More for Alignment (2023)
“The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge.”
Operationalising the Superficial Alignment Hypothesis via Task Complexity (2026)
The second paper, 3 years later, matches my intuition for post-training.
RLHF’s reputation was that its contributions are minor on the final language models.
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”
LIMA: Less Is More for Alignment (2023)
“The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge.”
Operationalising the Superficial Alignment Hypothesis via Task Complexity (2026)
The second paper, 3 years later, matches my intuition for post-training.
Apply the same RL algorithms to LLMs when the answer can be checked directly. No need to train a reward model:

| Classical RL | RLHF | RLVR | |
|---|---|---|---|
| Reward | Environment | Learned (proxy) | Verifiable (exact) |
| State transitions | Yes | No | No |
| Reward granularity | Per-step | Per-response | Per-response |
| Primary challenge | Explore-Exploit Trade-off | Over-optimization | Task generalization |
| Example | CartPole | Chat style tuning | Math reasoning |

A log-linear relationship between inference compute (number of tokens generated) and downstream performance.

An often underplayed portion of the o1 release (and future reasoning/agentic models).
Results in a two-sided scaling landscape for training language models – both pretraining and post-training. The third place of scaling is at inference (no weight updates there).



One of the few “fully open” large-scale RL runs to date.

Post-training and RLHF are changing faster than maybe ever before.