An introduction to reinforcement learning from human feedback and post-training

SALA 2026

Nathan Lambert

Quito, Ecuador
11 March 2026

A cursory overview of RLHF, RLVR, and modern post-training recipes for language models.

What is a language model?

Core properties:

  • A language model assigns probabilities to text.
  • Chunks of words are broken down as tokens, which are the internal representation of the model.
  • Given previous tokens, it predicts the next token. Repeating this produces a completion one step at a time (this is called autoregressive).
The original model architecture diagram for the Transformer. 2017.
The original model architecture diagram for the Transformer. 2017.

What is a (modern) language model?

Modern language models:

  • Have billions to trillions of parameters.
  • Largely downstream of the Transformer architecture, which popularized the use of the self-attention mechanism along with fully-dense layers.
  • Predict and work over much more than text: Gemini and ChatGPT work with images, audio, and video.
The original model architecture diagram for the Transformer. 2017.
The original model architecture diagram for the Transformer. 2017.

2017: The Transformer is born

  • 2017: the Transformer is born
The original model architecture diagram for the Transformer. 2017.
The original model architecture diagram for the Transformer. 2017.

2018: GPT-1, ELMo, and BERT

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
The language model architecture for GPT-1. 2018.
The language model architecture for GPT-1. 2018.

2019: GPT-2 and scaling laws

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
  • 2019: GPT-2 and scaling laws
The famous scaling laws plots. 2020.
The famous scaling laws plots. 2020.

2020: GPT-3 surprising capabilities

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
  • 2019: GPT-2 and scaling laws
  • 2020: GPT-3 surprising capabilities
GPT-3 was known for expanding the idea of in-context learning and few-shot prompting. Screenshot from the paper.
GPT-3 was known for expanding the idea of in-context learning and few-shot prompting. Screenshot from the paper.

2021: Stochastic Parrots

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
  • 2019: GPT-2 and scaling laws
  • 2020: GPT-3 surprising capabilities
  • 2021: Stochastic Parrots

2022: ChatGPT

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
  • 2019: GPT-2 and scaling laws
  • 2020: GPT-3 surprising capabilities
  • 2021: Stochastic Parrots
  • 2022: ChatGPT

2023: GPT-4 and frontier-scale

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
  • 2019: GPT-2 and scaling laws
  • 2020: GPT-3 surprising capabilities
  • 2021: Stochastic Parrots
  • 2022: ChatGPT
  • 2023: GPT-4 and frontier-scale
An image where Nvidia CEO Jensen Huang supposedly leaked that GPT-4 was an ~2T parameter MoE model.
An image where Nvidia CEO Jensen Huang supposedly leaked that GPT-4 was an ~2T parameter MoE model.

2024: o1 and reasoning models

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
  • 2019: GPT-2 and scaling laws
  • 2020: GPT-3 surprising capabilities
  • 2021: Stochastic Parrots
  • 2022: ChatGPT
  • 2023: GPT-4 and frontier-scale
  • 2024: o1 and reasoning models
The famous test-time scaling plot from OpenAI's o1 announcement.
The famous test-time scaling plot from OpenAI’s o1 announcement.

2025: o3, Claude Code, and agents

  • 2017: the Transformer is born
  • 2018: GPT-1, ELMo, and BERT released
  • 2019: GPT-2 and scaling laws
  • 2020: GPT-3 surprising capabilities
  • 2021: Stochastic Parrots
  • 2022: ChatGPT
  • 2023: GPT-4 and frontier-scale
  • 2024: o1 and reasoning models
  • 2025: o3, Claude Code, and agents

Pretraining: next-token prediction

  • Train on trillions of tokens of text from the web, books, code, and documents
    • Models are often trained on 5-50+ trillion tokens
    • 1T of text tokens is about 3-5 TB of data
    • Labs gather and filter 10-20X more data than is used for the model
    • Total data funnel targeted for models is on the order of petabytes
  • Objective: predict the next token in each sequence
  • Result: Incredible, flexible, useful models

A base model completes text

After pretraining we are left with a glorified autocomplete model, for example:1

User

The president of the United States in 2006 was

Assistant (Llama 3.1 405B Base)

George W. Bush, the governor of Florida in 2006 was Jeb Bush, and John McCain was an Arizona senator in 2006...

1: Base models are also becoming more flexible through midtraining and better data mixtures.

Post-training makes it answer like a chatbot

The earliest forms of modern post-trained (or RLHF-tuned) models shifted the continuation format to always conforming to the “answering a question style.” An example of what early conversational models looked like is below:

User

The president of the United States in 2006 was

Assistant (Tülu 3 405B)

George W. Bush was the president of the United States in 2006. He served two terms in office, from January 20, 2001, to January 20, 2009.

ChatGPT was when RLHF made the models even easier to use

Model responses evolved quickly to have:

  • Better format: direct, conversational answers
  • Better style: helpful, concise, markdown, etc.
  • Better product: people could actually use it every day
User

I'm giving a talk on RLHF tomorrow. Can you help me structure it?

Assistant

Absolutely — here's a simple structure:

1. Start with the basics

  • What a language model is
  • Why next-token prediction is not enough

2. Explain post-training

  • Instruction tuning
  • Preference tuning / RLHF
  • RLVR for verifiable tasks

3. End with why it matters

  • Better products like ChatGPT
  • Better reasoning in models like o1 ...

So what is reinforcement learning from human feedback (RLHF) anyways?

Which is the better backflip?

Why did people make RLHF?

  • Many objectives are easy for humans to judge, but hard to write as an exact reward function
  • In language models, what we want is often implicit: follow intent, be helpful, be harmless
  • Pretraining optimizes next-token prediction, not assistant behavior
  • Preference comparisons turn those human judgments into a scalable training signal

RLHF lets us optimize for behavior we can evaluate, even when we cannot easily specify the reward.

RLHF before language models

  • TAMER (Knox & Stone, 2008) — humans score agent actions to learn a reward
  • Christiano et al. 2017 — RLHF on Atari trajectory preferences
  • Ziegler et al. 2019 — first RLHF on language models
A recreation of the system diagram from Christiano et al. 2017.
A recreation of the system diagram from Christiano et al. 2017.

RLHF for language models: compare two completions

User

Explain the moon landing to a 6-year-old.

Assistant (Completion A)

The Apollo program culminated in a successful lunar landing in 1969. Astronauts used a spacecraft to descend to the moon's surface and collect samples before returning to Earth.

User

Explain the moon landing to a 6-year-old.

Assistant (Completion B)

People built a special rocket to go to the moon. Two astronauts landed there, walked around, and came home safely to tell everyone what they saw.

Left: Human feedback; Right: Hand-designed reward function

Classical RL

A reinforcement learning problem is often written as a Markov Decision Process (MDP):

  • state space \mathcal{S}, action space \mathcal{A}
  • transition dynamics P(s_{t+1}\mid s_t, a_t)
  • reward function r(s_t, a_t) and discount \gamma
  • optimize cumulative return over a trajectory
\text{MDP } (\mathcal{S}, \mathcal{A}, P, r, \gamma)
J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]

Classical RL vs. RLHF

Classical RL

  • Agent takes actions a_t in an environment with states s_t
  • Reward is a known function r(s_t, a_t) from the environment per step
  • Optimize cumulative return over a trajectory (total steps T)
J(\pi) = \mathbb{E}_{\tau \sim \pi}\!\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]

RLHF

  • No environment — prompts sampled from a dataset
  • Reward is learned from human preferences (a proxy)
  • Response-level reward (bandit-style, not per-token)
  • Regularized with KL penalty to stay close to the base model
J(\pi) = \mathbb{E}\left[ r_\theta(x, y) \right] - \beta \, D_{\text{KL}}\!\left(\pi \| \pi_{\text{ref}}\right)

Reinforcement learning with Verifiable rewards

Apply the same RL algorithms to LLMs when the answer can be checked directly. No need to train a reward model:

  • E.g. Math: check the final answer.
    Code: run the tests.
  • No learned reward model — no proxy objective
  • Enables scaling RL compute on reasoning tasks
  • Unlocked inference time scaling: Spending more compute at generation time per problem increases performance log-linearly w.r.t. compute
  • RLVR was named by Tülu 3 (Lambert et al., 2024) and popularized by DeepSeek R1 (Guo et al., 2025)

Comparing classical RL vs. LLM RLHF and RLVR

Classical RL RLHF RLVR
Reward Environment Learned (proxy) Verifiable (exact)
State transitions Yes No No
Reward granularity Per-step Per-response Per-response
Primary challenge Explore-Exploit Trade-off Over-optimization Task generalization
Example CartPole Chat style tuning Math reasoning

The path to modern RLHF

InstructGPT’s 3-step RLHF recipe

The 3-step RLHF process figure from InstructGPT, which became "the standard" approach to RLHF for a few years.
The 3-step RLHF process figure from InstructGPT, which became “the standard” approach to RLHF for a few years.

Step 1/3: Instruction fine-tuning (IFT)

The foundation of post-training. Also called Supervised Fine-tuning (SFT):

  • Start from a pretrained language model
  • Collect demonstrations of desired assistant behavior
  • Train with standard supervised learning on prompt-response pairs.
    (different batch size, learning rate, etc.)
  • Model can now answer questions.
    Easy to use IFT to quickly adapt base model to many domains.
\mathcal{L}_{\mathrm{SFT}}(\theta) = - \sum_{(x, y^\star)} \sum_{t=1}^{|y^\star|} \log \pi_\theta \!\left(y^\star_t \mid x, y^\star_{<t}\right)
System

You are a helpful, harmless assistant. A system message like this can be used to steer the model to specific personas or behaviors.

User

Write me a short poem about an optimistic goldfish.

Assistant

Bright little goldfish Finds a sunrise in each wave Small bowl, endless hope

Step 2/3: Reward modeling

Overview:

  • Collect comparisons between two model outputs for the same prompt
  • RLHF gets its name from collecting human feedback between completions, but today much of it is AI feedback
  • Train a reward model r_\phi(x, y) to score preferred completions higher

The probability model says a response should win when it gets a higher reward score:

P(y_w \succ y_l \mid x) = \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:

\mathcal{L}_{\mathrm{RM}}(\phi) = - \log \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Notation:

  • x is the prompt
  • y_w is the winning response
  • y_l is the losing response
  • r_\phi(x, y) is the trained reward model

Step 2/3: Reward modeling

Core Idea

The reward used in RLHF is the model predicting the probability that a given piece of text would be the "winning" or "chosen" completion in a pair/batch. Clever!

The probability model says a response should win when it gets a higher reward score:

P(y_w \succ y_l \mid x) = \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Training then minimizes the negative log-likelihood of the preferred response beating the rejected one:

\mathcal{L}_{\mathrm{RM}}(\phi) = - \log \sigma \!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)

Notation:

  • x is the prompt
  • y_w is the winning response
  • y_l is the losing response
  • r_\phi(x, y) is the trained reward model

Step 3/3: RL against the reward model

Where everything comes together (and RLHF gets its name):

  • Sample a batch of prompts x_i from the dataset \mathcal{D}
  • Generate completions y_i \sim \pi_\theta(\cdot \mid x_i) from the model being trained
  • Score them with the reward model r_\phi(x_i, y_i)
  • Add a KL penalty so the policy stays close to the SFT/reference model.1
  • Update the policy with a policy-gradient RL algorithm (Proximal Policy Optimization, PPO in InstructGPT & ChatGPT)
J(\pi) = \mathbb{E}\!\left[r_\phi(x, y)\right] - \beta D_{\mathrm{KL}}\!\left(\pi \,\|\, \pi_{\mathrm{ref}}\right)

1: KL divergence measures how much the current policy differs from the reference model. For discrete outputs, D_{\mathrm{KL}}(\pi \,\|\, \pi_{\mathrm{ref}})=\mathbb{E}_{y \sim \pi}\!\left[\log \pi(y \mid x)-\log \pi_{\mathrm{ref}}(y \mid x)\right]. People often colloquially call this the “KL distance” between the models, even though it is not a true metric.

The RLHF objective, unpacked

\max_{\pi} \; \mathbb{E}_{x \sim D,\; y \sim \pi(\cdot \mid x)} \underbrace{r_\phi(x, y)}_{\text{maximize the reward}} - \underbrace{\beta D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right)}_{\text{but don't change the model too much}}

The reference model \pi_{\mathrm{ref}} keeps the policy anchored to the SFT model.

D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right) measures how far the new policy moves from that reference on prompt x.

\beta controls the tradeoff between improving behavior and staying close to what the model already knows.

What if we optimize this more directly?

\max_{\pi} \; \mathbb{E}_{x \sim D,\; y \sim \pi(\cdot \mid x)} r_\phi(x, y) - \beta D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right)

Direct Preference Optimization (DPO)

  • Derived the gradient toward the optimal solution, \pi^*, to the above equation
  • Eliminated the need for a separate reward model (via training an implicit one)
  • Train directly on preferred (y_w) vs. rejected (y_l) responses to a prompt (x)
\mathcal{L}_{\mathrm{DPO}}(\theta) = - \log \sigma \!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right)

What if we optimize this more directly?

\max_{\pi} \; \mathbb{E}_{x \sim D,\; y \sim \pi(\cdot \mid x)} r_\phi(x, y) - \beta D_{\mathrm{KL}}\!\left(\pi(\cdot \mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot \mid x)\right)

Direct Preference Optimization (DPO)

  • Derived the gradient toward the optimal solution, \pi^*, to the above equation
  • Eliminated the need for a separate reward model (via training an implicit one)
  • Train directly on preferred (y_w) vs. rejected (y_l) responses to a prompt (x)
\mathcal{L}_{\mathrm{DPO}}(\theta) = - \log \sigma \!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right)
DPO became very popular as it is
  • Far simpler to implement
  • Far cheaper to run
  • Achieves ~80% or more of the final performance
  • I used it to build models like Zephyr-Beta, Tülu 2/3, Olmo 2/3, etc.

How training recipes have evolved

InstructGPT (2022) Tülu 3 (2024) DeepSeek R1 (2025)
Instruction data ~10K ~1M 100K+
Preference data ~100K ~1M On-policy
RL stage ~100K prompts ~10K (RLVR) N/A

An overall trend is to use far more compute across all the stages, but shifting more to RLVR.

The early days: InstructGPT

Early on, RLHF had a well-documented, simple enough approach.

  • InstructGPT made the classic three-stage recipe canonical: SFT, reward modeling, then RL against the reward model. OpenAI even hinted that the original ChatGPT used this!
  • This became the intellectual template for much of modern post-training.

From RLHF to post-training

What began as an “RLHF” recipe evolved into a complex series of steps to get the final, best model (e.g. Nemotron 4 340B, Llama 3.1).

  • Modern systems keep the same core idea of using multiple optimizers with different strengths and weaknesses, but add more stages, more data, and more filtering.
  • This trend has only continued, and recipes ebb and flow, as tools like RLVR and model merging change the scope of what is doable in different ways.

From RLHF to “post-training”

As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):

  1. 2023: Simple SFT for better chatbots and reproducing RLHF fundamentals (Alpaca, Vicuna, etc.)
  2. 2024: DPO dominates open models and training stages expand (Zephyr-beta, Tülu 2, etc.)
  3. 2025: RLVR, complex recipes (Tülu 3, Olmo 3, Nemotron 3, R1, etc.)
  4. 2026: Agentic training, multi-turn RL, etc.

From RLHF to “post-training”

As time has passed since ChatGPT, the field has gone through multiple distinct phases (roughly):

  1. 2023: Simple SFT for better chatbots and reproducing RLHF fundamentals (Alpaca, Vicuna, etc.)
  2. 2024: DPO dominates open models and training stages expand (Zephyr-beta, Tülu 2, etc.)
  3. 2025: RLVR, complex recipes (Tülu 3, Olmo 3, Nemotron 3, R1, etc.)
  4. 2026: Agentic training, multi-turn RL, etc.

Within 2024 the field shifted its focus to post-training, as training stages evolved beyond the InstructGPT-style recipe, DPO proliferated, and largely RLHF was viewed as one tool (that you may not even need).

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”

LIMA: Less Is More for Alignment (2023)

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”

LIMA: Less Is More for Alignment (2023)

Sometimes this view of alignment (or RLHF) teaching “format” made people think that post-training only made minor changes to the model. This would describe finetuning as “just style transfer.”

The base model trained on trillions of tokens of web text has seen and learned from an extremely broad set of examples. The model at this stage contains far more latent capability than early post-training recipes were able to expose.

The question is: How does post-training interact with these?

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

An example, OLMoE — same base model family, updated only post-training:

Base models determine the ceiling. Post-training’s job has been to reach it.

An intuition for post-training

RLHF’s reputation was that its contributions are minor on the final language models.

“A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users.”

LIMA: Less Is More for Alignment (2023)

“The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge.”

Operationalising the Superficial Alignment Hypothesis via Task Complexity (2026)

The second paper, 3 years later, matches my intuition for post-training.

I call this the Elicitation Theory of post-training, where we're trying to pull out the most useful knowledge of the model.

Beyond elicitation: The scaling RL era of post-training

OpenAI’s seminal scaling plot with o1-preview

o1: Test-time scaling

A log-linear relationship between inference compute (number of tokens generated) and downstream performance.

  • This is a fundamental property of models, unlocked in its popular form with RLVR
  • Can be done in many ways: One long chain of thought (CoT) sequence, multiple agents in parallel, or mixes of the two
  • Improving inference-time scaling changes the slope and offset of the curve

o1: Training-time scaling (with reinforcement learning!)

An often underplayed portion of the o1 release (and future reasoning/agentic models).

  • Scaling reinforcement learning compute also has a log-linear return on performance!
  • The core question: Is scaling RL training just eliciting more from the base model or actually teaching new abilities?

Results in a two-sided scaling landscape for training language models – both pretraining and post-training. The third place of scaling is at inference (no weight updates there).

Cursor Composer 1.5: RL scaling

DeepSeek-R1-Zero: RL scaling

Olmo 3.1: extending the RL run

One of the few “fully open” large-scale RL runs to date.

  • Training a general, 32B reasoning model.
  • Full RL training took about 28 days on 224 GPUs.
  • Improvements in performance were very consistent across the run, in fact they were still going up when we had to stop it!

Where this leaves us

Post-training and RLHF are changing faster than maybe ever before.

  • Language models are becoming “tool-use native” and are now about tools, harnesses (how you tell the model to use said tools), and much more than just weights
  • RLHF and human preferences haven’t gone away, but are evolving far more slowly and out of the central gaze of the industry
  • Building language models and doing research is changing rapidly with coding agents

This talk is ~lecture 1 of a larger course

Introductions (this talk)
  1. Introduction
  2. Key Related Works
  3. Training Overview
Core Training Pipeline
  1. Instruction Tuning
  2. Reward Models
  3. Reinforcement Learning
  4. Reasoning
  5. Direct Alignment
  6. Rejection Sampling
Data & Preferences
  1. What are Preferences
  2. Preference Data
  3. Synthetic Data & CAI
Practical Considerations
  1. Tool Use
  2. Over-optimization
  3. Regularization
  4. Evaluation
  5. Product & Character
Appendices
  • A. Definitions
  • B. Style & Information
  • C. Practical Issues

Full lecture slides coming to rlhfbook.com/course and YouTube @natolambert!

Thank you

Sorry I could not make it in person!

Contact: nathan@natolambert.com

Newsletter: interconnects.ai

rlhfbook.com

References (1/4)

Anthropic. “Claude Code.” 2025. [link]
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., et al.. “Constitutional ai: Harmlessness from ai feedback.” arXiv preprint arXiv:2212.08073, 2022.
Bender, E., Gebru, T., McMillan-Major, A., and Shmitchell, S.. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?.” Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021. [link]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al.. “Language Models are Few-Shot Learners.” Advances in Neural Information Processing Systems, 2020. [link]
Christiano, P., Leike, J., Brown, T., Martic, M., Legg, S., et al.. “Deep Reinforcement Learning from Human Preferences.” Advances in Neural Information Processing Systems, 2017.
Devlin, J., Chang, M., Lee, K., and Toutanova, K.. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv preprint arXiv:1810.04805, 2018. [link]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., et al.. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783, 2024.
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., et al.. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948, 2025.

References (2/4)

Kaplan, J., McCandlish, S., Henighan, T., Brown, T., Chess, B., et al.. “Scaling Laws for Neural Language Models.” arXiv preprint arXiv:2001.08361, 2020. [link]
Lambert, N., Morrison, J., Pyatkin, V., Huang, S., Ivison, H., et al.. “Tülu 3: Pushing Frontiers in Open Language Model Post-Training.” arXiv preprint arXiv:2411.15124, 2024.
Muennighoff, N., Soldaini, L., Groeneveld, D., Lo, K., Morrison, J., et al.. “OLMoE: Open Mixture-of-Experts Language Models.” International Conference on Learning Representations, 2025. [link]
Olmo, T., Ettinger, A., Bertsch, A., Kuehl, B., Graham, D., et al.. “Olmo 3.” 2025. [link]
OpenAI. “ChatGPT: Optimizing Language Models for Dialogue.” OpenAI Blog, 2022. [link]
OpenAI. “GPT-4 Technical Report.” arXiv preprint arXiv:2303.08774, 2023. [link]
OpenAI. “Introducing OpenAI o1-preview.” OpenAI Blog, 2024. [link]
OpenAI. “Introducing OpenAI o3 and o4-mini.” OpenAI Blog, 2025. [link]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., et al.. “Training language models to follow instructions with human feedback.” Advances in Neural Information Processing Systems, 2022.
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., et al.. “Deep Contextualized Word Representations.” Proceedings of NAACL-HLT, 2018. [link]

References (3/4)

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I.. “Improving Language Understanding by Generative Pre-Training.” 2018. [link]
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., et al.. “Language Models are Unsupervised Multitask Learners.” 2019. [link]
Rafailov, R., Sharma, A., Mitchell, E., Manning, C., Ermon, S., et al.. “Direct preference optimization: Your language model is secretly a reward model.” Advances in Neural Information Processing Systems, 2023.
Sutton, R., and Barto, A.. “Reinforcement Learning: An Introduction.” MIT Press, 2018. [link]
Team, O.. “OLMoE, meet iOS.” Ai2 Blog, 2025. [link]
Team, O.. “New Tools for Building Agents.” OpenAI Blog, 2025. [link]
Team, C.. “Introducing Composer 1.5.” Cursor Blog, 2026. [link]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., et al.. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017. [link]
Vergara-Browne, T., Patil, D., Titov, I., Reddy, S., Pimentel, T., et al.. “Operationalising the Superficial Alignment Hypothesis via Task Complexity.” arXiv preprint arXiv:2602.15829, 2026. [link]

References (4/4)

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., et al.. “LIMA: Less Is More for Alignment.” Advances in Neural Information Processing Systems, 2023. [link]
Ziegler, D., Stiennon, N., Wu, J., Brown, T., Radford, A., et al.. “Fine-tuning language models from human preferences.” arXiv preprint arXiv:1909.08593, 2019.