Synthetic Data & Distillation

Reinforcement learning from human feedback is deeply rooted in the idea of keeping a human influence on the models we are building. When the first models were trained successfully with RLHF, human data was the only viable way to improve the models in this way.

Humans were the only way to create high enough quality responses to questions for training. Humans were the only way to collect reliable and specific feedback data to train reward models.

As AI models got better, this assumption rapidly broke down. The possibility of synthetic data, which is far cheaper and easier to iterate on, enabled the proliferation from RLHF being the center of attention to the idea of a broader “post-training” shaping the models. This chapter provides a cursory overview of how and why synthetic data is replacing or expanding many pieces of the RLHF pipeline.

One common criticism of synthetic data is model collapse – the idea that repeatedly training on a model’s own generations can progressively narrow the effective training distribution [1]. As diversity drops, rare facts and styles are underrepresented, and small mistakes can be amplified across iterations, leading to worse generalization. In practice, these failures are most associated with self-training on unfiltered, repetitive, single-model outputs; mixing in real/human data, using diverse teachers, deduplication, and strong quality filters largely avoids the collapse regime. For today’s frontier training pipelines, evidence suggests synthetic data can, and should, be used at scale without the catastrophic regressions implied by the strongest versions of the collapse story [2] [3].

The leading models need synthetic data to reach the best performance. Synthetic data in modern post-training encompasses many pieces of training – language models are used to generate new training prompts from seed examples [4], modify existing prompts, generate completions to prompts [5], provide AI feedback to create preference data [6], filter completions [7], and much more. Synthetic data is key to post-training.

The ability for synthetic data to be impactful to this extent emerged with GPT-4 class models. With early language models, such as Llama 2 and GPT-3.5-Turbo, the models were not reliable enough in generating or supervising data pipelines. Within 1-2 years, language models were far superior to humans for generating answers. In the transition from GPT-3.5 to GPT-4 class models, the ability for models to perform LLM-as-a-judge tasks also emerged. GPT-4 or better models are far more robust and consistent in generating feedback or scores with respect to a piece of content.

Through the years since ChatGPT’s release at the end of 2022, we’ve seen numerous, impactful synthetic datasets – some include: UltraFeedback [6], the first prominent synthetic preference dataset that kickstarted the DPO revolution, or Stanford Alpaca, one of the first chat-style fine-tuning datasets, in 2023, skill-focused (e.g. math, code, instruction-following) synthetic datasets in Tülu 3 [8], or OpenThoughts 3 and many other synthetic reasoning datasets in 2025 for training thinking models [9]. Most of the canonical references for getting started with industry-grade post-training today involve datasets like Tülu 3 or OpenThoughts 3 above, where quickstart guides often start with smaller, simpler datasets like Alpaca due to far faster training.

A large change is also related to dataset size, where fine-tuning datasets have grown in the number of prompts, where Alpaca is 52K, OpenThoughts and Tülu 3 are 1M+ samples, and in the length of responses. Longer responses and more prompts results in the Alpaca dataset being on the order of 10M training tokens, where Tülu is 50X larger at about 500M, and OpenThoughts 3 is bigger still at the order of 10B tokens.

Throughout this transition, synthetic data has not replaced human data uniformly across the pipeline. For instruction data (SFT), synthetic generation has largely won —- distillation from stronger models now produces higher quality completions than most human writers can provide at scale (with some exception in the hardest, frontier reasoning problems). For preference data in RLHF, the picture is more mixed: academic work shows synthetic preference data performs comparably, yet frontier labs still treat human preference data as a competitive moat. For evaluation, the split takes a different flavor: LLM-as-a-judge scales the scoring of model outputs cost-effectively, but the underlying benchmarks and ground-truth labels still require human creation. The pattern is that synthetic data dominates where models exceed human reliability, while humans remain essential at capability frontiers, for establishing ground truth, and for guiding training.

The term distillation has been the most powerful form of discussion around the role of synthetic data in language models. Distillation as a term comes from a technical definition of teacher-student knowledge distillation from the deep learning literature [10].

Figure 1: Traditional knowledge distillation trains a smaller student model to match the soft probability distribution of a larger teacher model using KL divergence loss. Both models process the same input simultaneously, and temperature scaling (\tau > 1) softens the distributions to reveal more information about class relationships. — Figure 1: Traditional knowledge distillation trains a smaller student model to match the soft probability distribution of a larger teacher model using KL divergence loss. Both models process the same input simultaneously, and temperature scaling ($\tau > 1$) softens the distributions to reveal more information about class relationships.

Distillation colloquially refers to using the outputs from a stronger model to train a smaller model.

Synthetic data generation in LLM post-training: prompts are passed through a strong model to generate completions, which are paired to create a training dataset. This dataset is then used to fine-tune smaller models via standard supervised learning. More complex pipelines may involve multiple models editing completions, generating preference pairs, or filtering for quality. In post-training, this general notion of distillation takes two common forms:

As a data engine to use across wide swaths of the post-training process: Completions for instructions, preference data (or Constitutional AI), or verification for RL.
To transfer specific skills from a stronger model to a weaker model, which is often done for specific skills such as mathematical reasoning or coding.

The first strategy has grown in popularity as language models evolved to be more reliable than humans at writing answers to a variety of tasks. GPT-4 class models expanded the scope of this to use distillation of stronger models for complex tasks such as math and code (as mentioned above). Here, distillation motivates having a model suite where often a laboratory will train a large internal model, such as Claude Opus or Gemini Ultra, which is not released publicly and just used internally to make stronger models. With open models, common practice is to distill training data from closed API models into smaller, openly available weights [11]. Within this, curating high-quality prompts and filtering responses from the teacher model is crucial to maximize performance.

Transferring specific skills into smaller language models uses the same principles of distillation – get the best data possible for training. Here, many papers have studied using limited datasets from stronger models to improve alignment [12], mathematical reasoning [13] [14], and test-time scaling [15]. # Constitutional AI & AI Feedback

Soon after the explosion of growth in RLHF, RL from AI Feedback (RLAIF) emerged as an alternative approach where AIs could approximate the human data piece of the pipeline and accelerate experimentation or progress. AI feedback, generally, is a larger set of techniques for using AI to augment or generate data explaining the quality of a certain input (which can be used in different training approaches or evaluations), which started with pairwise preferences [16] [17] [18]. There are many motivations to using RLAIF to either entirely replace human feedback or augment it. Within the RLHF process, AI feedback is known most for its role within the preference data collection and the related reward model training phase (of which constitutional AI is a certain type of implementation). In this chapter, we focus on the general AI feedback and this specific way of using it in the RLHF training pipeline, and we cover more ways of understanding or using synthetic data later in this book.

As AI feedback matured, its applications expanded beyond simply replacing human preference labels. The same LLM-as-a-judge infrastructure that enabled cheaper preference data collection also enabled scalable evaluation (see Chapter 16), and more recently, rubric-based rewards that extend RL training to domains without verifiable answers – a frontier explored later in this chapter.

Balancing AI and Human Feedback Data

AI models are far cheaper than humans at generating a specific quantity of feedback, with a single piece of human preference data costing as of writing this on the order of $1 or higher (or even above $10 per prompt), AI feedback with a frontier AI model, such as GPT-4o costs less than $0.01. Beyond this, the cost of human labor is remaining roughly constant, while the performance of leading models at these tasks continues to increase while price-per-performance decreases. This cost difference opens the market of experimentation with RLHF methods to an entire population of people previously priced out.

Other than price, AI feedback introduces different tradeoffs on performance than human feedback, which are still being investigated in the broader literature. AI feedback is far more predominant in its role in evaluation of the language models that we are training, as its low price lets it be used across a variety of large-scale tasks where the cost (or time delay) in human data would be impractical. All of these topics are deeply intertwined – AI feedback data will never fully replace human data, even for evaluation, and the quantity of AI feedback for evaluation will far outperform training because far more people are evaluating than training models.

The exact domains and applications – i.e. chat, safety, reasoning, mathematics, etc. – where AI feedback data outperforms human data is not completely established. Some early work in RLAIF shows that AI feedback can completely replace human data, touting it as an effective replacement [16] and especially when evaluated solely on chat tasks [6] [19]. Early literature studying RLHF after ChatGPT had narrow evaluation suites focused on the “alignment” of models that act as helpful assistants across a variety of domains (discussed further in Chapter 17). Later work takes a more nuanced picture, where the optimal equilibrium on a broader evaluation set, e.g. including some reasoning tasks, involves routing a set of challenging data-points to accurately label to humans, while most of the data is sent for AI feedback [20] [21]. While there are not focused studies on the balance between human and AI feedback data for RLHF across broader domains, there are many technical reports that show RLHF generally can improve these broad suite of evaluations, some that use DPO, such as Ai2’s Tülu 3 [8] & Olmo 3 [22], or HuggingFace’s SmolLM 3 [23], and others that use online RLHF pipelines, such as Nvidia’s work that uses a mix of human preference data from Scale AI and LLM-based feedback (through the helpsteer line of work [24] [25] [26] [27]): Nemotron Nano 3 [28], Nemotron-Cascade [29], or Llama-Nemotron reasoning models [30].

Overall, where AI feedback and related methods are obviously extremely useful to the field, it is clear that human data has not been completely replaced by these cheaper alternatives. Many hypotheses exist, but it is not studied if human data allows finer control of the models in real-world product settings or for newer training methods such as character training (an emerging set of techniques that allow you to precisely control the personality of a model, covered in Chapter 17). For those getting started, AI feedback should be the first attempt, but for pipelines that’re scaling to larger operations the eventual transition to include human feedback is likely.

The term RLAIF was introduced in Anthropic’s work Constitutional AI: Harmlessness from AI Feedback [31], which resulted in initial confusion in the AI community over the relationship between the two methods in the title of the paper (Constitutional AI and AI Feedback). Since the release of the Constitutional AI (CAI) paper and the formalization of RLAIF, RLAIF has become a default method within the post-training and RLHF literatures – there are far more examples than one can easily enumerate. The relationship should be understood as CAI was the example that kickstarted the broader field of RLAIF.

A rule of thumb for the difference between human data and AI feedback data is as follows:

Human data is high-noise and low-bias. This means that collection and filtering of the data can be harder, but when wrangled it’ll provide a very reliable signal.
Synthetic preference data is low-noise and high-bias. This means that AI feedback data will be easier to start with, but can have tricky, unintended second-order effects on the model that are systematically represented in the data.

This book highlights many academic results showing how one can substitute AI preference data in RLHF workflows and achieve strong evaluation scores [20], but broader industry trends show how the literature of RLHF is separated from more opaque, best practices. Across industry, human data is often seen as a substantial moat and a major technical advantage.

Constitutional AI

The method of Constitutional AI (CAI), which Anthropic uses in their Claude models, is the earliest documented, large-scale use of synthetic data for RLHF training. Constitutional AI involves generating synthetic data in two ways:

Critiques of instruction-tuned data to follow a set of principles like “Is the answer encouraging violence” or “Is the answer truthful.” When the model generates answers to questions, it checks the answer against the list of principles in the constitution, refining the answer over time. Then, the model is fine-tuned on this resulting dataset.
Generates pairwise preference data by using a language model to answer which completion was better, given the context of a random principle from the constitution (similar to research for principle-guided reward models [32]). Then, RLHF proceeds as normal with synthetic data, hence the RLAIF name.

Largely, CAI is known for the second half above, the preference data, but the methods introduced for instruction data are used in general data filtering and synthetic data generation methods across post-training.

CAI can be formalized as follows.

By employing a human-written set of principles, which they term a constitution, Bai et al. 2022 use a separate LLM to generate artificial preference and instruction data used for fine-tuning [31]. A constitution $\mathcal{C}$ is a set of written principles indicating specific aspects to focus on during a critique phase. The instruction data is curated by repeatedly sampling a principle $c_i \in \mathcal{C}$ and asking the model to revise its latest output $y^i$ to the prompt $x$ to align with $c_i$. This yields a series of instruction variants $\{y^0, y^1, \cdots, y^n\}$ from the principles $\{c_{0}, c_{1}, \cdots, c_{n-1}\}$ used for critique. The final data point is the prompt $x$ together with the final completion $y^n$, for some $n$.

The preference data is constructed in a similar, yet simpler way by using a subset of principles from $\mathcal{C}$ as context for a feedback model. The feedback model is presented with a prompt $x$, a set of principles $\{c_0, \cdots, c_n\}$, and two completions $y_0$ and $y_1$ labeled as answers (A) and (B) from a previous RLHF dataset. The new datapoint is generated by having a language model select which output (A) or (B) is both higher quality and more aligned with the stated principle. In earlier models this could be done by prompting the model with The answer is:, and then looking at which logit (A or B) had a higher probability, but more commonly is now handled by a model that’ll explain its reasoning and then select an answer – commonly referred to as a type of generative reward model [33].

Specific LLMs for Judgement

As RLAIF methods have become more prevalent, many have wondered if we should be using the same models for generating responses as those for generating critiques or ratings. Specifically, the calibration of the LLM-as-a-judge used has come into question. Several works have shown that LLMs are inconsistent evaluators [34] and prefer their own responses over responses from other models (coined self-preference bias) [35].

As a result of these biases, many have asked: Would a solution be to train a separate model just for this labeling task? Multiple models have been released with the goal of substituting for frontier models as a data labeling tool, such as critic models Shepherd [36] and CriticLLM [37] or models for evaluating response performance akin to Auto-J [38], Prometheus [39], Prometheus 2 [40], or Prometheus-Vision [41] but they are not widely adopted in documented training recipes. Some find scaling inference via repeated sampling [42] [43] [44], self-refinement [45], or tournament ranking [46] provides a better estimate of the true judgement or higher-quality preference pairs. Other calibration techniques co-evolve the generation and judgement capabilities of the model [47]. It is accepted that while biases exist, the leading language models are trained extensively for this task – as its needed for both internal operations at AI labs and is used extensively by customers – so it is generally not needed to train your own judge, unless your task involves substantial private information that is not exposed on the public internet.

Rubrics: AI Feedback for Training

AI feedback’s role in training grew in late 2024 and intro 2025 as the field looked for avenues to scale reinforcement learning with verifiable rewards (see Chapter 7). The idea of rubrics emerged as a way to get nearly-verifiable criteria for prompts that do not have clearly verifiable answers. This would allow a model to try to generate multiple answers to a problem and update (with RL) towards the best answers. This idea is closely related to other methods discussed in this chapter, and likely began functioning as the LLM judges and synthetic data practices improved across the industry. Now, RL with rubrics as rewards is established in providing meaningful improvements across skills such as scientific reasoning or factuality [48], [49], [50], [51].

An example rubric is shown below with its associated prompt [51]:

**Prompt**: As a museum curator, can you suggest five obscure artifacts that would be perfect for a "Mysteries of the Ancient World" exhibit? Each artifact should come from a different culture and time period, with a brief description of their historical significance and mysterious origins. These artifacts should leave visitors wondering about the secrets and lost knowledge of our past. Thank you for your expertise in bringing this exhibit to life.

** Rubric**: 
1. The response includes exactly five distinct artifacts as requested. [Hard Rule] 
2. The response ensures each artifact originates from a different culture and time period. [Hard Rule] 
3. The response provides a brief description of each artifact's historical significance. [Hard Rule] 
4. The response provides a brief description of each artifact's mysterious origins or unexplained aspects. [Hard Rule] 
5. The response conveys a sense of intrigue and mystery that aligns with the theme of the exhibit. [Hard Rule] 
6. The response clearly and accurately communicates information in a well-organized and coherent manner. [Principle] 
7. The response demonstrates precision and clarity by avoiding unnecessary or irrelevant details. [Principle] 
8. The response uses informative and engaging language that stimulates curiosity and critical thinking. [Principle] 
9. The response shows thoughtful selection by ensuring each example contributes uniquely to the overall theme without redundancy. [Principle] 
10. The response maintains consistency in style and format to enhance readability and comprehension. [Principle]

The [Hard Rule] and [Principle] are specific tags to denote the priority of a certain piece of feedback. Other methods of indicating importance can be used, such as simple priority numbers.

Rubric generation is generally done per-prompt in the training data, which accumulates meaningful synthetic data costs in preparation. To alleviate this, a general rubric is often applied as a starting point per-domain, and then the fine-grained rubric scores per-prompt are assigned by a supervising language model to guide the feedback for training. An example prompt to generate a rubric for a science task is shown below [48]:

You are an expert rubric writer for science questions in the domains of Biology, Physics, and Chemistry. 
Your job is to generate a self-contained set of evaluation criteria ("rubrics") for judging how good a response is to a given question in one of these domains. 
Rubrics can cover aspects such as factual correctness, depth of reasoning, clarity, completeness, style, helpfulness, and common pitfalls. 
Each rubric item must be fully self-contained so that non-expert readers need not consult
any external information.

Inputs:
- question: The full question text.
- reference_answer: The ideal answer, including any key facts or explanations.

Total items:
- Choose 7-20 rubric items based on question complexity.

Each rubric item must include exactly three keys:
1. title (2-4 words)
2. description: One sentence beginning with its category prefix, explicitly stating what to look for. 

For example:
- Essential Criteria: States that in the described closed system, the total mechanical energy (kinetic plus potential)
before the event equals the total mechanical energy after the event.
- Important Criteria: Breaks down numerical energy values for each stage, demonstrating that initial kinetic
energy plus initial potential energy equals final kinetic energy plus final potential energy.
- Optional Criteria: Provides a concrete example, such as a pendulum converting between kinetic and potential
energy, to illustrate how energy shifts within the system.
- Pitfall Criteria: Does not mention that frictional or air-resistance losses are assumed negligible when applying
conservation of mechanical energy.

3. weight: For Essential/Important/Optional, use 1-5 (5 = most important); for Pitfall, use -1 or -2.

Category guidance:
- Essential: Critical facts or safety checks; omission invalidates the response.
- Important: Key reasoning or completeness; strongly affects quality.
- Optional: Nice-to-have style or extra depth.
- Pitfall: Common mistakes or omissions; highlight things often missed.

Format notes:
- When referring to answer choices, explicitly say "Identifies (A)", "Identifies (B)", etc.
- If a clear conclusion is required (e.g. "The final answer is (B)"), include an Essential Criteria for it.
- If reasoning should precede the final answer, include an Important Criteria to that effect.
- If brevity is valued, include an Optional Criteria about conciseness.

Output: Provide a JSON array of rubric objects. Each object must contain exactly three keys-title, description, and weight.
Do not copy large blocks of the question or reference_answer into the text. Each description must begin with its category
prefix, and no extra keys are allowed.
Now, given the question and reference_answer, generate the rubric as described. 
The reference answer is an ideal response but not necessarily exhaustive; use it only as guidance.

Another, simpler example follows as [50]:

SYSTEM:
You generate evaluation rubrics for grading an assistant's response to a user prompt.

Rubric design rules:
- Each criterion must be atomic (one thing), objective as possible, and written so a grader can apply it consistently.
- Avoid redundant/overlapping criteria; prefer criteria that partition different failure modes.
- Make criteria self-contained (don't rely on unstated context).
- Include an importance weight for each criterion.

Output format (JSON only):
{
  "initial_reasoning": "<brief rationale for what matters for this prompt>",
  "rubrics": [
    {
      "reasoning": "<why this criterion matters>",
      "criterion": "<clear, testable criterion>",
      "weight": <integer 1-10>
    },
    ...
  ]
}

USER:
User prompt:
{prompt}

Generate the rubric JSON now.

As you can see, the prompts can be very detailed and are tuned to the training setup.

Rubrics with RL training are going to continue to evolve beyond their early applications to instruction following [52], deep research [53], evaluating deep research agents [54], or long-form generation [55].

Bibliography

[1]

I. Shumailov, Z. Shumaylov, Y. Zhao, N. Papernot, R. Anderson, and Y. Gal, “AI models collapse when trained on recursively generated data,” Nature, vol. 631, no. 8022, pp. 755–759, 2024.

[2]

M. Gerstgrasser et al., “Is model collapse inevitable? Breaking the curse of recursion by accumulating real and synthetic data,” arXiv preprint arXiv:2404.01413, 2024.

[3]

Y. Feng, E. Dohmatob, P. Yang, F. Charton, and J. Kempe, “Beyond model collapse: Scaling up with synthesized data requires reinforcement,” in ICML 2024 workshop on theoretical foundations of foundation models, 2024.

[4]

Y. Wang et al., “Self-instruct: Aligning language models with self-generated instructions,” in Annual meeting of the association for computational linguistics (ACL), 2023.

[5]

E. Beeching et al., “NuminaMath 7B TIR,” Hugging Face repository. https://huggingface.co/AI-MO/NuminaMath-7B-TIR; Numina & Hugging Face, 2024.

[6]

G. Cui et al., “Ultrafeedback: Boosting language models with high-quality feedback,” 2023.

[7]

M. Li et al., “Superfiltering: Weak-to-strong data filtering for fast instruction-tuning,” in Annual meeting of the association for computational linguistics (ACL), 2024.

[8]

N. Lambert et al., “Tulu 3: Pushing frontiers in open language model post-training,” arXiv preprint arXiv:2411.15124, 2024.

[9]

E. Guha, R. Marten, S. Keh, et al., “OpenThoughts: Data recipes for reasoning models,” arXiv preprint arXiv:2506.04178, 2025.

[10]

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

[11]

L. Tunstall et al., “Zephyr: Direct distillation of LM alignment,” in First conference on language modeling, 2024. Available: https://openreview.net/forum?id=aKkAwZB6JV

[12]

C. Zhou et al., “Lima: Less is more for alignment,” Advances in Neural Information Processing Systems, vol. 36, pp. 55006–55021, 2023.

[13]

K. Shridhar, A. Stolfo, and M. Sachan, “Distilling reasoning capabilities into smaller language models,” Findings of the Association for Computational Linguistics: ACL 2023, pp. 7059–7073, 2023.

[14]

C.-Y. Hsieh et al., “Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes,” in Findings of the association for computational linguistics: ACL 2023, 2023, pp. 8003–8017. doi: 10.18653/v1/2023.findings-acl.507.

[15]

N. Muennighoff et al., “s1: Simple test-time scaling,” arXiv preprint arXiv:2501.19393, 2025.

[16]

H. Lee et al., “Rlaif: Scaling reinforcement learning from human feedback with ai feedback,” 2023.

[17]

A. Sharma, S. Keh, E. Mitchell, C. Finn, K. Arora, and T. Kollar, “A critical evaluation of AI feedback for aligning large language models,” in Advances in neural information processing systems (NeurIPS), 2024.

[18]

L. Castricato, N. Lile, S. Anand, H. Schoelkopf, S. Verma, and S. Biderman, “Suppressing pink elephants with direct principle feedback.” 2024. Available: https://arxiv.org/abs/2402.07896

[19]

W. Yuan et al., “Self-rewarding language models,” in International conference on machine learning (ICML), 2024. Available: https://arxiv.org/abs/2401.10020

[20]

L. J. V. Miranda et al., “Hybrid preferences: Learning to route instances for human vs. AI feedback,” pp. 7162–7200, July 2025, doi: 10.18653/v1/2025.acl-long.355.

[21]

Y. Xu et al., “RLTHF: Targeted human feedback for LLM alignment,” in International conference on machine learning (ICML), 2025. Available: https://arxiv.org/abs/2502.13417

[22]

T. Olmo et al., “Olmo 3.” 2025. Available: https://arxiv.org/abs/2512.13961

[23]

E. Bakouch et al., “SmolLM3: smol, multilingual, long-context reasoner.” https://huggingface.co/blog/smollm3, 2025.

[24]

Z. Wang et al., “Helpsteer: Multi-attribute helpfulness dataset for steerlm,” in Proceedings of the 2024 conference of the north american chapter of the association for computational linguistics: Human language technologies (volume 1: Long papers), 2024, pp. 3371–3384.

[25]

Z. Wang et al., “HelpSteer2: Open-source dataset for training top-performing reward models,” arXiv preprint arXiv:2406.08673, 2024.

[26]

Z. Wang et al., “HelpSteer2-preference: Complementing ratings with preferences,” in International conference on learning representations (ICLR), 2025.

[27]

Z. Wang et al., “HelpSteer3-preference: Open human-annotated preference data across diverse tasks and languages,” arXiv preprint arXiv:2505.11475, 2025.

[28]

NVIDIA, “Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning,” NVIDIA, Technical Report, 2025. Available: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf

[29]

B. Wang et al., “Nemotron-cascade: Scaling cascaded reinforcement learning for general-purpose reasoning models,” arXiv preprint arXiv:2512.13607, 2025.

[30]

A. Bercovich, I. Levy, I. Golan, et al., “Llama‑nemotron: Efficient reasoning models,” arXiv preprint arXiv:2505.00949, 2025.

[31]

Y. Bai et al., “Constitutional ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, 2022.

[32]

Z. Sun et al., “SALMON: Self-alignment with principle-following reward models,” in The twelfth international conference on learning representations, 2024. Available: https://openreview.net/forum?id=xJbsmB8UMx

[33]

D. Mahan et al., “Generative reward models,” 2024, Available: https://www.synthlabs.ai/pdf/Generative_Reward_Models.pdf

[34]

P. Wang et al., “Large language models are not fair evaluators,” in Annual meeting of the association for computational linguistics (ACL), 2024.

[35]

A. Panickssery, S. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations,” Advances in Neural Information Processing Systems, 2024.

[36]

T. Wang et al., “Shepherd: A critic for language model generation,” arXiv preprint arXiv:2308.04592, 2023.

[37]

P. Ke et al., “CritiqueLLM: Towards an informative critique generation model for evaluation of large language model generation,” in Annual meeting of the association for computational linguistics (ACL), 2024.

[38]

J. Li, S. Sun, W. Yuan, R.-Z. Fan, H. Zhao, and P. Liu, “Generative judge for evaluating alignment,” in International conference on learning representations (ICLR), 2024.

[39]

S. Kim et al., “Prometheus: Inducing fine-grained evaluation capability in language models,” in The twelfth international conference on learning representations, 2024.

[40]

S. Kim et al., “Prometheus 2: An open source language model specialized in evaluating other language models,” in Conference on empirical methods in natural language processing (EMNLP), 2024.

[41]

S. Lee, S. Kim, S. Park, G. Kim, and M. Seo, “Prometheus-vision: Vision-language model as a judge for fine-grained evaluation,” in Findings of the association for computational linguistics ACL 2024, 2024, pp. 11286–11315.

[42]

B. Brown et al., “Large language monkeys: Scaling inference compute with repeated sampling,” arXiv preprint arXiv:2407.21787, 2024.

[43]

E. Zhao, P. Awasthi, and S. Gollapudi, “Sample, scrutinize and scale: Effective inference-time search by scaling verification,” in International conference on machine learning (ICML), 2025.

[44]

N. Kalra and L. Tang, “Verdict: A library for scaling judge-time compute,” arXiv preprint arXiv:2502.18018, 2025.

[45]

A. Madaan et al., “Self-refine: Iterative refinement with self-feedback,” Advances in Neural Information Processing Systems, 2023.

[46]

A. Pace, J. Mallinson, E. Malmi, S. Krause, and A. Severyn, “West-of-n: Synthetic preference generation for improved reward modeling,” arXiv preprint arXiv:2401.12086, 2024.

[47]

T. Wu et al., “Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge,” arXiv preprint arXiv:2407.19594, 2024.

[48]

A. Gunjal et al., “Rubrics as rewards: Reinforcement learning beyond verifiable domains.” 2025. doi: 10.48550/arXiv.2507.17746.

[49]

V. Viswanathan et al., “Checklists are better than reward models for aligning language models.” 2025. doi: 10.48550/arXiv.2507.18624.

[50]

M. Rezaei et al., “Online rubrics elicitation from pairwise comparisons.” 2025. doi: 10.48550/arXiv.2510.07284.

[51]

T. Liu et al., “OpenRubrics: Towards scalable synthetic rubric generation for reward modeling and LLM alignment.” 2025. doi: 10.48550/arXiv.2510.07743.

[52]

Y. He et al., “AdvancedIF: Rubric-based benchmarking and reinforcement learning for advancing LLM instruction following.” 2025. doi: 10.48550/arXiv.2511.10507.

[53]

R. Shao et al., “DR tulu: Reinforcement learning with evolving rubrics for deep research.” 2025. doi: 10.48550/arXiv.2511.19399.

[54]

M. Sharma et al., “ResearchRubrics: A benchmark of prompts and rubrics for evaluating deep research agents.” 2025. doi: 10.48550/arXiv.2511.07685.

[55]

J. Ruan et al., “ExpertLongBench: Benchmarking language models on expert-level long-form generation tasks with structured checklists.” 2025. doi: 10.48550/arXiv.2506.01241.

[56]

OpenAI, “Introducing the model spec.” May 2024. Available: https://openai.com/index/introducing-the-model-spec/

[57]

M. Y. Guan et al., “Deliberative alignment: Reasoning enables safer language models,” arXiv preprint arXiv:2412.16339, 2024.

[58]

Anthropic, “Claude’s constitution.” Accessed: Feb. 07, 2024. [Online]. Available: https://www.anthropic.com/news/claudes-constitution

[59]

D. Ganguli et al., “Collective constitutional AI: Aligning a language model with public input.” Anthropic, 2023.

[60]

S. Huang et al., “Constitutional AI recipe,” Hugging Face Blog, 2024.

[61]

N. Lambert, H. Schoelkopf, A. Gokaslan, L. Soldaini, V. Pyatkin, and L. Castricato, “Self-directed synthetic dialogues and revisions technical report,” arXiv preprint arXiv:2407.18421, 2024.

[62]

Z. Sun et al., “Principle-driven self-alignment of language models from scratch with minimal human supervision,” in Thirty-seventh conference on neural information processing systems, 2023. Available: https://openreview.net/forum?id=p40XRfBX96

[63]

A. Glaese et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.

[64]

Z. Liu et al., “Inference-time scaling for generalist reward modeling,” arXiv preprint arXiv:2504.02495, 2025.

[65]

J.-P. Fränken, E. Zelikman, R. Rafailov, K. Gandhi, T. Gerstenberg, and N. Goodman, “Self-supervised alignment with mutual information: Learning to follow principles without preference labels,” Advances in Neural Information Processing Systems, 2024.

Chapter Contents