Nathan Lambert
Quito, Ecuador
11 March 2026
A cursory overview of RLHF, RLVR, and modern post-training recipes for language models.
Core properties:

Modern language models:











After pretraining we are left with a glorified autocomplete model, for example:1
The president of the United States in 2006 was
George W. Bush, the governor of Florida in 2006 was Jeb Bush, and John McCain was an Arizona senator in 2006...
The earliest forms of modern post-trained (or RLHF-tuned) models shifted the continuation format to always conforming to the “answering a question style.” An example of what early conversational models looked like is below:
The president of the United States in 2006 was
George W. Bush was the president of the United States in 2006. He served two terms in office, from January 20, 2001, to January 20, 2009.
Model responses evolved quickly to have:
I'm giving a talk on RLHF tomorrow. Can you help me structure it?
Absolutely — here's a simple structure:
1. Start with the basics
2. Explain post-training
3. End with why it matters


RLHF lets us optimize for behavior we can evaluate, even when we cannot easily specify the reward.

Explain the moon landing to a 6-year-old.
The Apollo program culminated in a successful lunar landing in 1969. Astronauts used a spacecraft to descend to the moon's surface and collect samples before returning to Earth.
Explain the moon landing to a 6-year-old.
People built a special rocket to go to the moon. Two astronauts landed there, walked around, and came home safely to tell everyone what they saw.


A reinforcement learning problem is often written as a Markov Decision Process (MDP):

Classical RL
RLHF

Apply the same RL algorithms to LLMs when the answer can be checked directly. No need to train a reward model:

| Classical RL | RLHF | RLVR | |
|---|---|---|---|
| Reward | Environment | Learned (proxy) | Verifiable (exact) |
| State transitions | Yes | No | No |
| Reward granularity | Per-step | Per-response | Per-response |
| Primary challenge | Explore-Exploit Trade-off | Over-optimization | Task generalization |
| Example | CartPole | Chat style tuning | Math reasoning |


The foundation of post-training. Also called Supervised Fine-tuning (SFT):
You are a helpful, harmless assistant. A system message like this can be used to steer the model to specific personas or behaviors.
Write me a short poem about an optimistic goldfish.
Bright little goldfish Finds a sunrise in each wave Small bowl, endless hope
Overview:
The probability model says a response should win when it gets a higher reward score:
Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:
Notation:
The reward used in RLHF is the model predicting the probability that a given piece of text would be the "winning" or "chosen" completion in a pair/batch. Clever!
The probability model says a response should win when it gets a higher reward score:
Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:
Notation:
Where everything comes together (and RLHF gets its name):

The reference model \pi_{\mathrm{ref}} keeps the policy anchored to the SFT model.
D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right) measures how far the new policy moves from that reference on prompt x.
\beta controls the tradeoff between improving behavior and staying close to what the model already knows.
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO)
| InstructGPT (2022) | Tülu 3 (2024) | DeepSeek R1 (2025) | |
|---|---|---|---|
| Instruction data | ~10K | ~1M | 100K+ |
| Preference data | ~100K | ~1M | On-policy |
| RL stage | ~100K prompts | ~10K (RLVR) | N/A |
An overall trend is to use far more compute across all the stages, but shifting more to RLVR.

Early on, RLHF had a well-documented, simple enough approach.

What began as an “RLHF” recipe evolved into a complex series of steps to get the final, best model (e.g. Nemotron 4 340B, Llama 3.1).
As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):
As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):
Within 2024 the field shifted its focus to post-training, as training stages evolved beyond the InstructGPT-style recipe, DPO proliferated, and largely RLHF was viewed as one tool (that you may not even need).
RLHF’s reputation was that its contributions are minor on the final language models.
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”
LIMA: Less Is More for Alignment (2023)
RLHF’s reputation was that its contributions are minor on the final language models.
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”
LIMA: Less Is More for Alignment (2023)
Sometimes this view of alignment (or RLHF) teaching “format” made people think that post-training only made minor changes to the model. This would describe finetuning as “just style transfer.”
The base model trained on trillions of tokens of web text has seen and learned from an extremely broad set of examples. The model at this stage contains far more latent capability than early post-training recipes were able to expose.
The question is: How does post-training interact with these?
RLHF’s reputation was that its contributions are minor on the final language models.
An example, OLMoE — same base model family, updated only post-training:
OLMoE-1B-7B-0924-Instruct (Sep. 2024): 38.44 avg. eval scoreOLMoE-1B-7B-0125-Instruct (Jan. 2025): 45.62 avg. eval scoreBase models determine the ceiling. Post-training’s job has been to reach it.
RLHF’s reputation was that its contributions are minor on the final language models.
“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”
LIMA: Less Is More for Alignment (2023)
“The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge.”
Operationalising the Superficial Alignment Hypothesis via Task Complexity (2026)
The second paper, 3 years later, matches my intuition for post-training.

A log-linear relationship between inference compute (number of tokens generated) and downstream performance.

An often underplayed portion of the o1 release (and future reasoning/agentic models).
Results in a two-sided scaling landscape for training language models – both pretraining and post-training. The third place of scaling is at inference (no weight updates there).



One of the few “fully open” large-scale RL runs to date.

Post-training and RLHF are changing faster than maybe ever before.
Full lecture slides coming to rlhfbook.com/course and YouTube @natolambert!
Sorry I could not make it in person!
Contact: nathan@natolambert.com
Newsletter: interconnects.ai
rlhfbook.com