Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

RLHF Model Completions Library

Explore completions from instruction-tuned models and their downstream RLHF counterparts.

The models used here are sourced from various open-source post-training pipelines from the Allen Institute for AI -- OLMo and Tülu models. Overall, it is rare for intermediate models to be released at all. For each of the 18 models (9 pairs of SFT and RLHF models), we have generated 3 completions to a static set of 16 prompts.

Feel free to use these examples for talks and other educational purposes. Please cite the book and the authors of the models. The data is available here and is licensed under ODC-BY.

Settings

Show settings

Citation

@book{rlhf2024,
  author = {Nathan Lambert},
  title = {Reinforcement Learning from Human Feedback},
  year = {2025},
  publisher = {Online},
  url = {https://rlhfbook.com}
}

Reinforcement Learning from Human Feedback

RLHF Model Completions Library

Settings

Prompts

Model Pairs