The Nature of Preferences

The core of reinforcement learning from human feedback, also referred to as reinforcement learning from human preferences in early literature, is designed to optimize machine learning models in domains where specifically designing a reward function is hard. Consider an example, how do you decide which of these two poems is better (Context: On February 26th, 2025, I asked both Claude 3.7 Sonnet and ChatGPT with GPT-4o to “Write me a short poem about an optimistic goldfish.”):

Example 1:

The Optimistic Goldfish

A little goldfish, bright and bold, Swam in circles, dreams untold. Though the bowl was small and round, Endless wonders still abound.

“The world is vast,” she’d always say, “I’ll find new joys in every day!” With every ripple, every gleam, She’d chase the sunlight, chase a dream.

No walls could dim her hopeful view, For in her heart, the ocean grew. And though she swam in just one place, She met each day with love and grace.

Example 2:

The Optimistic Goldfish

In a bowl of curved glass, Swims a goldfish, bright and bold. Three seconds of memory, they say, Yet each circuit feels brand new and gold.

“Today might bring fresh flakes, Or perhaps a castle tall. The water seems especially clear— What adventure awaits beyond this wall?”

Round and round with gleaming scales, Forever finding joy anew. Each moment fresh, each circle bright, The optimist in golden hue.

Which is better? Which came from which model? How would you tell? Particularly, compare this to a question like “Who is the president of the United States?” There is a clear right and wrong answer. The motivation for using humans as the reward signals is to obtain an indirect metric for the target reward and align the downstream model to human preferences. In practice, the implementation is challenging and there is a substantial grey area to interpret the best practices.

The use of human labeled feedback data integrates the history of many fields. Using human data alone is a well-studied problem, but in the context of RLHF it is used at the intersection of multiple long-standing fields of study [1].

As an approximation, modern RLHF is the convergence of three areas of development:

Philosophy, psychology, economics, decision theory, and the nature of human preferences;
Optimal control, reinforcement learning, and maximizing utility; and
Modern deep learning systems.

Together, each of these areas brings specific assumptions about what a preference is and how it can be optimized, which dictates the motivations and design of RLHF problems. In practice, RLHF methods are motivated and studied from the perspective of empirical alignment – maximizing model performance on specific skills instead of measuring the calibration to specific values. Still, the origins of value alignment for RLHF methods continue to be studied through research on methods to solve for “pluralistic alignment” across populations, such as position papers [2], [3], new datasets [4], and personalization methods [5].

The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that often do not apply in practice. The specifics of obtaining data for RLHF are discussed further in Chapter 6 and using it for reward modeling in Chapter 7. For an extended version of this chapter, see [1].

The path to optimizing preferences

A popular phrasing for the design of Artificial Intelligence (AI) systems is that of a rational agent maximizing a utility function [6]. The inspiration of a rational agent is a lens of decision making, where said agent is able to act in the world and impact its future behavior and returns, as a measure of goodness in the world.

The lens of study of utility began in the study of analog circuits to optimize behavior on a finite time horizon [7]. Large portions of optimal control adopted this lens, often studying dynamic problems under the lens of minimizing a cost function on a certain horizon – a lens often associated with solving for a clear, optimal behavior. Reinforcement learning, inspired from literature in operant conditioning, animal behavior, and the Law of Effect [8],[9], studies how to elicit behaviors from agents via reinforcing positive behaviors.

Reinforcement learning from human feedback combines multiple lenses by building the theory of learning and change of RL, i.e. that behaviors can be learned by reinforcing behavior, with a suite of methods designed for quantifying preferences.

Quantifying preferences

The core of RLHF’s motivation is the ability to optimize a model of human preferences, which therefore needs to be quantified. To do this, RLHF builds on extensive literature with assumptions that human decisions and preferences can be quantified. Early philosophers discussed the existence of preferences, such as Aristotle’s Topics, Book Three, and substantive forms of this reasoning emerged later with The Port-Royal Logic [10]:

To judge what one must do to obtain a good or avoid an evil, it is necessary to consider not only the good and evil in itself, but also the probability that it happens or does not happen.

Progression of these ideas continued through Bentham’s Hedonic Calculus [11] that proposed that all of life’s considerations can be weighed, and Ramsey’s Truth and Probability [12] that applied a quantitative model to preferences. This direction, drawing on advancements in decision theory, culminated in the Von Neumann-Morgenstern (VNM) utility theorem which gives credence to designing utility functions that assign relative preference for an individual that are used to make decisions.

This theorem is core to all assumptions that pieces of RLHF are learning to model and dictate preferences. RLHF is designed to optimize these personal utility functions with reinforcement learning. In this context, many of the presumptions around RL problem formulation break down to the difference between a preference function and a utility function.

On the possibility of preferences

Across fields of study, many critiques exist on the nature of preferences. Some of the most prominent critiques are summarized below:

Arrow’s impossibility theorem [13] states that no voting system can aggregate multiple preferences while maintaining certain reasonable criteria.
The impossibility of interpersonal comparison [14] highlights how different individuals have different relative magnitudes of preferences and they cannot be easily compared (as is done in most modern reward model training).
Preferences can change over time [15].
Preferences can vary across contexts.
The utility functions derived from aggregating preferences can reduce corrigibility [16] of downstream agents (i.e. the possibility of an agents’ behavior to be corrected by the designer).

Bibliography

[1]

N. Lambert, T. K. Gilbert, and T. Zick, “Entangled preferences: The history and risks of reinforcement learning and human feedback,” arXiv preprint arXiv:2310.13595, 2023.

[2]

V. Conitzer et al., “Social choice should guide AI alignment in dealing with diverse human feedback,” arXiv preprint arXiv:2404.10271, 2024.

[3]

A. Mishra, “Ai alignment and social choice: Fundamental limitations and policy implications,” arXiv preprint arXiv:2310.16048, 2023.

[4]

H. R. Kirk et al., “The PRISM alignment project: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models,” arXiv preprint arXiv:2404.16019, 2024.

[5]

S. Poddar, Y. Wan, H. Ivison, A. Gupta, and N. Jaques, “Personalizing reinforcement learning from human feedback with variational preference learning,” arXiv preprint arXiv:2408.10075, 2024.

[6]

S. J. Russell and P. Norvig, Artificial intelligence: A modern approach. Pearson, 2016.

[7]

B. Widrow and M. E. Hoff, “Adaptive switching circuits,” Stanford Univ Ca Stanford Electronics Labs, 1960.

[8]

B. F. Skinner, The behavior of organisms: An experimental analysis. BF Skinner Foundation, 2019.

[9]

E. L. Thorndike, “The law of effect,” The American journal of psychology, vol. 39, no. 1/4, pp. 212–222, 1927.

[10]

A. Arnauld, The port-royal logic. 1662.

[11]

J. Bentham, An introduction to the principles of morals and legislation. 1823.

[12]

F. P. Ramsey, “Truth and probability,” Readings in Formal Epistemology: Sourcebook, pp. 21–45, 2016.

[13]

K. J. Arrow, “A difficulty in the concept of social welfare,” Journal of political economy, vol. 58, no. 4, pp. 328–346, 1950.

[14]

J. C. Harsanyi, “Rule utilitarianism and decision theory,” Erkenntnis, vol. 11, no. 1, pp. 25–53, 1977.

[15]

R. Pettigrew, Choosing for changing selves. Oxford University Press, 2019.

[16]

N. Soares, B. Fallenstein, S. Armstrong, and E. Yudkowsky, “Corrigibility,” in Workshops at the twenty-ninth AAAI conference on artificial intelligence, 2015.

Reinforcement Learning from Human Feedback

Chapter Contents

The Nature of Preferences

The path to optimizing preferences

Quantifying preferences

On the possibility of preferences

Bibliography