An introduction to reinforcement learning from human feedback and post-training

SALA 2026

Nathan Lambert

Quito, Ecuador
11 March 2026

A cursory overview of RLHF, RLVR, and modern post-training recipes for language models.

What is a language model?

Core properties:

A language model assigns probabilities to text.
Chunks of words are broken down as tokens, which are the internal representation of the model.
Given previous tokens, it predicts the next token. Repeating this produces a completion one step at a time (this is called autoregressive).

The original model architecture diagram for the Transformer. 2017.

Vaswani et al., 2017

What is a (modern) language model?

Modern language models:

Have billions to trillions of parameters.
Largely downstream of the Transformer architecture, which popularized the use of the self-attention mechanism along with fully-dense layers.
Predict and work over much more than text: Gemini and ChatGPT work with images, audio, and video.

Vaswani et al., 2017

2017: The Transformer is born

2017: the Transformer is born

Vaswani et al., 2017

2018: GPT-1, ELMo, and BERT

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released

The language model architecture for GPT-1. 2018.

Devlin et al., 2018; Peters et al., 2018; Radford et al., 2018

2019: GPT-2 and scaling laws

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released
2019: GPT-2 and scaling laws

Kaplan et al., 2020; Radford et al., 2019

2020: GPT-3 surprising capabilities

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities

GPT-3 was known for expanding the idea of in-context learning and few-shot prompting. Screenshot from the paper.

Brown et al., 2020

2021: Stochastic Parrots

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities
2021: Stochastic Parrots

Bender et al., 2021

2022: ChatGPT

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities
2021: Stochastic Parrots
2022: ChatGPT

OpenAI, 2022

2023: GPT-4 and frontier-scale

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities
2021: Stochastic Parrots
2022: ChatGPT
2023: GPT-4 and frontier-scale

An image where Nvidia CEO Jensen Huang supposedly leaked that GPT-4 was an ~2T parameter MoE model.

OpenAI, 2023

2024: o1 and reasoning models

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities
2021: Stochastic Parrots
2022: ChatGPT
2023: GPT-4 and frontier-scale
2024: o1 and reasoning models

The famous test-time scaling plot from OpenAI's o1 announcement. — The famous test-time scaling plot from OpenAI’s o1 announcement.

OpenAI, 2024

2025: o3, Claude Code, and agents

2017: the Transformer is born
2018: GPT-1, ELMo, and BERT released
2019: GPT-2 and scaling laws
2020: GPT-3 surprising capabilities
2021: Stochastic Parrots
2022: ChatGPT
2023: GPT-4 and frontier-scale
2024: o1 and reasoning models
2025: o3, Claude Code, and agents

Anthropic, 2025; OpenAI, 2025; Team, 2025

Pretraining: next-token prediction

Train on trillions of tokens of text from the web, books, code, and documents
- Models are often trained on 5-50+ trillion tokens
- 1T of text tokens is about 3-5 TB of data
- Labs gather and filter 10-20X more data than is used for the model
- Total data funnel targeted for models is on the order of petabytes
Objective: predict the next token in each sequence
Result: Incredible, flexible, useful models

A base model completes text

After pretraining we are left with a glorified autocomplete model, for example:¹

User

The president of the United States in 2006 was

Assistant (Llama 3.1 405B Base)

George W. Bush, the governor of Florida in 2006 was Jeb Bush, and John McCain was an Arizona senator in 2006...

1: Base models are also becoming more flexible through midtraining and better data mixtures.

Post-training makes it answer like a chatbot

The earliest forms of modern post-trained (or RLHF-tuned) models shifted the continuation format to always conforming to the “answering a question style.” An example of what early conversational models looked like is below:

User

The president of the United States in 2006 was

Assistant (Tülu 3 405B)

George W. Bush was the president of the United States in 2006. He served two terms in office, from January 20, 2001, to January 20, 2009.

ChatGPT was when RLHF made the models even easier to use

Model responses evolved quickly to have:

Better format: direct, conversational answers
Better style: helpful, concise, markdown, etc.
Better product: people could actually use it every day

User

I'm giving a talk on RLHF tomorrow. Can you help me structure it?

Assistant

Absolutely — here's a simple structure:

1. Start with the basics

What a language model is
Why next-token prediction is not enough

2. Explain post-training

Instruction tuning
Preference tuning / RLHF
RLVR for verifiable tasks

3. End with why it matters

Better products like ChatGPT
Better reasoning in models like o1 ...

So what is reinforcement learning from human feedback (RLHF) anyways?

Which is the better backflip?

Christiano et al., 2017

Why did people make RLHF?

Many objectives are easy for humans to judge, but hard to write as an exact reward function
In language models, what we want is often implicit: follow intent, be helpful, be harmless
Pretraining optimizes next-token prediction, not assistant behavior
Preference comparisons turn those human judgments into a scalable training signal

RLHF lets us optimize for behavior we can evaluate, even when we cannot easily specify the reward.

Bai et al., 2022; Christiano et al., 2017; Ouyang et al., 2022; Ziegler et al., 2019

RLHF before language models

TAMER (Knox & Stone, 2008) — humans score agent actions to learn a reward
Christiano et al. 2017 — RLHF on Atari trajectory preferences
Ziegler et al. 2019 — first RLHF on language models

Christiano et al., 2017

RLHF for language models: compare two completions

User

Explain the moon landing to a 6-year-old.

Assistant (Completion A)

The Apollo program culminated in a successful lunar landing in 1969. Astronauts used a spacecraft to descend to the moon's surface and collect samples before returning to Earth.

User

Explain the moon landing to a 6-year-old.

Assistant (Completion B)

People built a special rocket to go to the moon. Two astronauts landed there, walked around, and came home safely to tell everyone what they saw.

Ziegler et al., 2019

Left: Human feedback; Right: Hand-designed reward function

Christiano et al., 2017

Classical RL

A reinforcement learning problem is often written as a Markov Decision Process (MDP):

state space \mathcal{S}, action space \mathcal{A}
transition dynamics P(s_{t+1}\mid s_t, a_t)
reward function r(s_t, a_t) and discount \gamma
optimize cumulative return over a trajectory

\text{MDP } (\mathcal{S}, \mathcal{A}, P, r, \gamma)

J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]

Sutton & Barto, 2018

Classical RL vs. RLHF

Classical RL

Agent takes actions a_t in an environment with states s_t
Reward is a known function r(s_t, a_t) from the environment per step
Optimize cumulative return over a trajectory (total steps T)

J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]

RLHF

No environment — prompts sampled from a dataset
Reward is learned from human preferences (a proxy)
Response-level reward (bandit-style, not per-token)
Regularized with KL penalty to stay close to the base model

J(\pi) = \mathbb{E}\left[ r_\theta(x, y) \right] - \beta \, D_{\text{KL}}\!\left(\pi \| \pi_{\text{ref}}\right)

Christiano et al., 2017; Ouyang et al., 2022

Reinforcement learning with Verifiable rewards

Apply the same RL algorithms to LLMs when the answer can be checked directly. No need to train a reward model:

E.g. Math: check the final answer.
Code: run the tests.
No learned reward model — no proxy objective
Enables scaling RL compute on reasoning tasks
Unlocked inference time scaling: Spending more compute at generation time per problem increases performance log-linearly w.r.t. compute
RLVR was named by Tülu 3 (Lambert et al., 2024) and popularized by DeepSeek R1 (Guo et al., 2025)

Comparing classical RL vs. LLM RLHF and RLVR

	Classical RL	RLHF	RLVR
Reward	Environment	Learned (proxy)	Verifiable (exact)
State transitions	Yes	No	No
Reward granularity	Per-step	Per-response	Per-response
Primary challenge	Explore-Exploit Trade-off	Over-optimization	Task generalization
Example	CartPole	Chat style tuning	Math reasoning

The path to modern RLHF

Ziegler 2019 (Ziegler et al., 2019) — first RLHF on language models
InstructGPT (Ouyang et al., 2022) — the canonical RLHF recipe behind ChatGPT
Constitutional AI (Bai et al., 2022) — Introduced early methods for AI feedback in Claude

DPO (Rafailov et al., 2023) — direct preference optimization (DPO) without a reward model
Llama 3 (Grattafiori et al., 2024) and Tülu 3 (Lambert et al., 2024) — modern multi-stage recipes
DeepSeek R1 (Guo et al., 2025) — popularized RLVR

InstructGPT’s 3-step RLHF recipe

The 3-step RLHF process figure from InstructGPT, which became "the standard" approach to RLHF for a few years. — The 3-step RLHF process figure from InstructGPT, which became “the standard” approach to RLHF for a few years.

Ouyang et al., 2022

Step 1/3: Instruction fine-tuning (IFT)

The foundation of post-training. Also called Supervised Fine-tuning (SFT):

Start from a pretrained language model
Collect demonstrations of desired assistant behavior
Train with standard supervised learning on prompt-response pairs.
(different batch size, learning rate, etc.)
Model can now answer questions.
Easy to use IFT to quickly adapt base model to many domains.

\mathcal{L}_{\mathrm{SFT}}(\theta) = - \sum_{(x, y^\star)} \sum_{t=1}^{|y^\star|} \log \pi_\theta \!\left(y^\star_t \mid x, y^\star_{<t}\right)

System

You are a helpful, harmless assistant. A system message like this can be used to steer the model to specific personas or behaviors.

User

Write me a short poem about an optimistic goldfish.

Assistant

Bright little goldfish Finds a sunrise in each wave Small bowl, endless hope

Ouyang et al., 2022

Step 2/3: Reward modeling

Overview:

Collect comparisons between two model outputs for the same prompt
RLHF gets its name from collecting human feedback between completions, but today much of it is AI feedback
Train a reward model r_\phi(x, y) to score preferred completions higher

The probability model says a response should win when it gets a higher reward score:

P(y_w \succ y_l \mid x) = \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:

\mathcal{L}_{\mathrm{RM}}(\phi) = - \log \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Notation:

x is the prompt
y_w is the winning response
y_l is the losing response
r_\phi(x, y) is the trained reward model

Christiano et al., 2017; Ouyang et al., 2022

Step 2/3: Reward modeling

Core Idea

The reward used in RLHF is the model predicting the probability that a given piece of text would be the "winning" or "chosen" completion in a pair/batch. Clever!

The probability model says a response should win when it gets a higher reward score:

P(y_w \succ y_l \mid x) = \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:

\mathcal{L}_{\mathrm{RM}}(\phi) = - \log \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Notation:

x is the prompt
y_w is the winning response
y_l is the losing response
r_\phi(x, y) is the trained reward model

Christiano et al., 2017; Ouyang et al., 2022

Step 3/3: RL against the reward model

Where everything comes together (and RLHF gets its name):

Sample a batch of prompts x_i from the dataset \mathcal{D}
Generate completions y_i \sim \pi_\theta(\cdot \mid x_i) from the model being trained
Score them with the reward model r_\phi(x_i, y_i)
Add a KL penalty so the policy stays close to the SFT/reference model.¹
Update the policy with a policy-gradient RL algorithm (Proximal Policy Optimization, PPO in InstructGPT & ChatGPT)

J(\pi) = \mathbb{E}\!\left[r_\phi(x, y)\right] - \beta D_{\mathrm{KL}}\!\left(\pi \,\|\, \pi_{\mathrm{ref}}\right)

Ouyang et al., 2022

1: KL divergence measures how much the current policy differs from the reference model. For discrete outputs, D_{\mathrm{KL}}(\pi \,\|\, \pi_{\mathrm{ref}})=\mathbb{E}_{y \sim \pi}\!\left[\log \pi(y \mid x)-\log \pi_{\mathrm{ref}}(y \mid x)\right]. People often colloquially call this the “KL distance” between the models, even though it is not a true metric.

The RLHF objective, unpacked

\max_{\pi} \; \mathbb{E}_{x \sim D,\; y \sim \pi(\cdot \mid x)} \underbrace{r_\phi(x, y)}_{\text{maximize the reward}} - \underbrace{\beta D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right)}_{\text{but don't change the model too much}}

The reference model \pi_{\mathrm{ref}} keeps the policy anchored to the SFT model.

D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right) measures how far the new policy moves from that reference on prompt x.

\beta controls the tradeoff between improving behavior and staying close to what the model already knows.

What if we optimize this more directly?

\max_{\pi} \; \mathbb{E}_{x \sim D,\; y \sim \pi(\cdot \mid x)} r_\phi(x, y) - \beta D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right)

Direct Preference Optimization (DPO)

Derived the gradient toward the optimal solution, \pi^*, to the above equation
Eliminated the need for a separate reward model (via training an implicit one)
Train directly on preferred (y_w) vs. rejected (y_l) responses to a prompt (x)

\mathcal{L}_{\mathrm{DPO}}(\theta) = - \log \sigma \!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right)

Rafailov et al., 2023

What if we optimize this more directly?

\max_{\pi} \; \mathbb{E}_{x \sim D,\; y \sim \pi(\cdot \mid x)} r_\phi(x, y) - \beta D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right)

Direct Preference Optimization (DPO)

Derived the gradient toward the optimal solution, \pi^*, to the above equation
Eliminated the need for a separate reward model (via training an implicit one)
Train directly on preferred (y_w) vs. rejected (y_l) responses to a prompt (x)

\mathcal{L}_{\mathrm{DPO}}(\theta) = - \log \sigma \!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right)

DPO became very popular as it is

Far simpler to implement
Far cheaper to run
Achieves ~80% or more of the final performance
I used it to build models like Zephyr-Beta, Tülu 2/3, Olmo 2/3, etc.

Rafailov et al., 2023

How training recipes have evolved

	InstructGPT (2022)	Tülu 3 (2024)	DeepSeek R1 (2025)
Instruction data	~10K	~1M	100K+
Preference data	~100K	~1M	On-policy
RL stage	~100K prompts	~10K (RLVR)	N/A

An overall trend is to use far more compute across all the stages, but shifting more to RLVR.

The early days: InstructGPT

Early on, RLHF had a well-documented, simple enough approach.

InstructGPT made the classic three-stage recipe canonical: SFT, reward modeling, then RL against the reward model. OpenAI even hinted that the original ChatGPT used this!
This became the intellectual template for much of modern post-training.

From RLHF to post-training

What began as an “RLHF” recipe evolved into a complex series of steps to get the final, best model (e.g. Nemotron 4 340B, Llama 3.1).

Modern systems keep the same core idea of using multiple optimizers with different strengths and weaknesses, but add more stages, more data, and more filtering.
This trend has only continued, and recipes ebb and flow, as tools like RLVR and model merging change the scope of what is doable in different ways.

From RLHF to “post-training”

As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):

2023: Simple SFT for better chatbots and reproducing RLHF fundamentals (Alpaca, Vicuna, etc.)
2024: DPO dominates open models and training stages expand (Zephyr-beta, Tülu 2, etc.)
2025: RLVR, complex recipes (Tülu 3, Olmo 3, Nemotron 3, R1, etc.)
2026: Agentic training, multi-turn RL, etc.

From RLHF to “post-training”

As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):

2023: Simple SFT for better chatbots and reproducing RLHF fundamentals (Alpaca, Vicuna, etc.)
2024: DPO dominates open models and training stages expand (Zephyr-beta, Tülu 2, etc.)
2025: RLVR, complex recipes (Tülu 3, Olmo 3, Nemotron 3, R1, etc.)
2026: Agentic training, multi-turn RL, etc.

Within 2024 the field shifted its focus to post-training, as training stages evolved beyond the InstructGPT-style recipe, DPO proliferated, and largely RLHF was viewed as one tool (that you may not even need).

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”

LIMA: Less Is More for Alignment (2023)

Zhou et al., 2023

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”

LIMA: Less Is More for Alignment (2023)

Sometimes this view of alignment (or RLHF) teaching “format” made people think that post-training only made minor changes to the model. This would describe finetuning as “just style transfer.”

The base model trained on trillions of tokens of web text has seen and learned from an extremely broad set of examples. The model at this stage contains far more latent capability than early post-training recipes were able to expose.

The question is: How does post-training interact with these?

Muennighoff et al., 2025; Team, 2025; Zhou et al., 2023

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

An example, OLMoE — same base model family, updated only post-training:

OLMoE-1B-7B-0924-Instruct (Sep. 2024): 38.44 avg. eval score
OLMoE-1B-7B-0125-Instruct (Jan. 2025): 45.62 avg. eval score

Base models determine the ceiling. Post-training’s job has been to reach it.

Muennighoff et al., 2025; Team, 2025; Zhou et al., 2023

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”

LIMA: Less Is More for Alignment (2023)

“The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge.”

Operationalising the Superficial Alignment Hypothesis via Task Complexity (2026)

The second paper, 3 years later, matches my intuition for post-training.

I call this the Elicitation Theory of post-training, where we're trying to pull out the most useful knowledge of the model.

Vergara-Browne et al., 2026; Zhou et al., 2023

Beyond elicitation: The scaling RL era of post-training

OpenAI’s seminal scaling plot with o1-preview

OpenAI, 2024

o1: Test-time scaling

A log-linear relationship between inference compute (number of tokens generated) and downstream performance.

This is a fundamental property of models, unlocked in its popular form with RLVR
Can be done in many ways: One long chain of thought (CoT) sequence, multiple agents in parallel, or mixes of the two
Improving inference-time scaling changes the slope and offset of the curve

OpenAI, 2024

o1: Training-time scaling (with reinforcement learning!)

An often underplayed portion of the o1 release (and future reasoning/agentic models).

Scaling reinforcement learning compute also has a log-linear return on performance!
The core question: Is scaling RL training just eliciting more from the base model or actually teaching new abilities?

Results in a two-sided scaling landscape for training language models – both pretraining and post-training. The third place of scaling is at inference (no weight updates there).

OpenAI, 2024

Cursor Composer 1.5: RL scaling

Team, 2026

DeepSeek-R1-Zero: RL scaling

Guo et al., 2025

Olmo 3.1: extending the RL run

One of the few “fully open” large-scale RL runs to date.

Training a general, 32B reasoning model.
Full RL training took about 28 days on 224 GPUs.
Improvements in performance were very consistent across the run, in fact they were still going up when we had to stop it!

Olmo et al., 2025

Where this leaves us

Post-training and RLHF are changing faster than maybe ever before.

Language models are becoming “tool-use native” and are now about tools, harnesses (how you tell the model to use said tools), and much more than just weights
RLHF and human preferences haven’t gone away, but are evolving far more slowly and out of the central gaze of the industry
Building language models and doing research is changing rapidly with coding agents

This talk is ~lecture 1 of a larger course

Introductions (this talk)

Introduction
Key Related Works
Training Overview

Core Training Pipeline

Instruction Tuning
Reward Models
Reinforcement Learning
Reasoning
Direct Alignment
Rejection Sampling

Data & Preferences

What are Preferences
Preference Data
Synthetic Data & CAI

Practical Considerations

Tool Use
Over-optimization
Regularization
Evaluation
Product & Character

Appendices

A. Definitions
B. Style & Information
C. Practical Issues

Full lecture slides coming to rlhfbook.com/course and YouTube @natolambert!

Thank you

Sorry I could not make it in person!

Contact: nathan@natolambert.com

Newsletter: interconnects.ai

rlhfbook.com

Built withnatolambert/colloquium

References (1/4)

Anthropic. “Claude Code.” 2025. [link]

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., et al.. “Constitutional ai: Harmlessness from ai feedback.” arXiv preprint arXiv:2212.08073, 2022.

Bender, E., Gebru, T., McMillan-Major, A., and Shmitchell, S.. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?.” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021. [link]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al.. “Language Models are Few-Shot Learners.” Advances in Neural Information Processing Systems, 2020. [link]

Christiano, P., Leike, J., Brown, T., Martic, M., Legg, S., et al.. “Deep Reinforcement Learning from Human Preferences.” Advances in Neural Information Processing Systems, 2017.

Devlin, J., Chang, M., Lee, K., and Toutanova, K.. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv preprint arXiv:1810.04805, 2018. [link]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., et al.. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783, 2024.

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., et al.. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948, 2025.

References (2/4)

Kaplan, J., McCandlish, S., Henighan, T., Brown, T., Chess, B., et al.. “Scaling Laws for Neural Language Models.” arXiv preprint arXiv:2001.08361, 2020. [link]

Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., et al.. “Tülu 3: Pushing Frontiers in Open Language Model Post-Training.” arXiv preprint arXiv:2411.15124, 2024.

Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Morrison, J., et al.. “OLMoE: Open Mixture-of-Experts Language Models.” International Conference on Learning Representations, 2025. [link]

Olmo, T., Ettinger, A., Bertsch, A., Kuehl, B., Graham, D., et al.. “Olmo 3.” 2025. [link]

OpenAI. “ChatGPT: Optimizing Language Models for Dialogue.” OpenAI Blog, 2022. [link]

OpenAI. “GPT-4 Technical Report.” arXiv preprint arXiv:2303.08774, 2023. [link]

OpenAI. “Introducing OpenAI o1-preview.” OpenAI Blog, 2024. [link]

OpenAI. “Introducing OpenAI o3 and o4-mini.” OpenAI Blog, 2025. [link]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al.. “Training language models to follow instructions with human feedback.” Advances in Neural Information Processing Systems, 2022.

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., et al.. “Deep Contextualized Word Representations.” Proceedings of NAACL-HLT, 2018. [link]

References (3/4)

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I.. “Improving Language Understanding by Generative Pre-Training.” 2018. [link]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., et al.. “Language Models are Unsupervised Multitask Learners.” 2019. [link]

Rafailov, R., Sharma, A., Mitchell, E., Manning, C., Ermon, S., et al.. “Direct preference optimization: Your language model is secretly a reward model.” Advances in Neural Information Processing Systems, 2023.

Sutton, R., and Barto, A.. “Reinforcement Learning: An Introduction.” MIT Press, 2018. [link]

Team, O.. “OLMoE, meet iOS.” Ai2 Blog, 2025. [link]

Team, O.. “New Tools for Building Agents.” OpenAI Blog, 2025. [link]

Team, C.. “Introducing Composer 1.5.” Cursor Blog, 2026. [link]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al.. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017. [link]

Vergara-Browne, T., Patil, D., Titov, I., Reddy, S., Pimentel, T., et al.. “Operationalising the Superficial Alignment Hypothesis via Task Complexity.” arXiv preprint arXiv:2602.15829, 2026. [link]

References (4/4)

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., et al.. “LIMA: Less Is More for Alignment.” Advances in Neural Information Processing Systems, 2023. [link]

Ziegler, D., Stiennon, N., Wu, J., Brown, T., Radford, A., et al.. “Fine-tuning language models from human preferences.” arXiv preprint arXiv:1909.08593, 2019.