Synthetic Data & Distillation
Reinforcement learning from human feedback is deeply rooted in the idea of keeping a human influence on the models we are building. When the first models were trained successfully with RLHF, human data was the only viable way to improve the models in this way.
Humans were the only way to create high enough quality responses to questions for training. Humans were the only way to collect reliable and specific feedback data to train reward models.
As AI models got better, this assumption rapidly broke down. The possibility of synthetic data, which is far cheaper and easier to iterate on, enabled the proliferation from RLHF being the center of attention to the idea of a broader “post-training” shaping the models. This chapter provides a cursory overview of how and why synthetic data is replacing or expanding many pieces of the RLHF pipeline.
One common criticism of synthetic data is model collapse – the idea that repeatedly training on a model’s own generations can progressively narrow the effective training distribution [1]. As diversity drops, rare facts and styles are underrepresented, and small mistakes can be amplified across iterations, leading to worse generalization. In practice, these failures are most associated with self-training on unfiltered, repetitive, single-model outputs; mixing in real/human data, using diverse teachers, deduplication, and strong quality filters largely avoids the collapse regime. For today’s frontier training pipelines, evidence suggests synthetic data can, and should, be used at scale without the catastrophic regressions implied by the strongest versions of the collapse story [2] [3].
The leading models need synthetic data to reach the best performance. Synthetic data in modern post-training encompasses many pieces of training – language models are used to generate new training prompts from seed examples [4], modify existing prompts, generate completions to prompts [5], provide AI feedback to create preference data [6], filter completions [7], and much more. Synthetic data is key to post-training.
The ability for synthetic data to be impactful to this extent emerged with GPT-4 class models. With early language models, such as Llama 2 and GPT-3.5-Turbo, the models were not reliable enough in generating or supervising data pipelines. Within 1-2 years, language models were far superior to humans for generating answers. In the transition from GPT-3.5 to GPT-4 class models, the ability for models to perform LLM-as-a-judge tasks also emerged. GPT-4 or better models are far more robust and consistent in generating feedback or scores with respect to a piece of content.
Through the years since ChatGPT’s release at the end of 2022, we’ve seen numerous, impactful synthetic datasets – some include: UltraFeedback [6], the first prominent synthetic preference dataset that kickstarted the DPO revolution, or Stanford Alpaca, one of the first chat-style fine-tuning datasets, in 2023, skill-focused (e.g. math, code, instruction-following), synthetic datasets in Tülu 3 [8], or OpenThoughts 3 and many other synthetic reasoning datasets in 2025 for training thinking models [9]. Most of the canonical references for getting started with industry-grade post-training today involve datasets like Tülu 3 or OpenThoughts 3 above, where quickstart guides often start with smaller, simpler datasets like Alpaca due to far faster training.
A large change is also related to dataset size, where fine-tuning datasets have grown in the number of prompts, where Alpaca is 52K, OpenThoughts and Tülu 3 are 1M+ samples, and in the length of responses. Longer responses and more prompts results in the Alpaca dataset being on the order of 10M training tokens, where Tülu is 50X larger at about 500M, and OpenThoughts 3 is bigger still at the order of 10B tokens.
Throughout this transition, synthetic data has not replaced human data uniformly across the pipeline. For instruction data (SFT), synthetic generation has largely won —- distillation from stronger models now produces higher quality completions than most human writers can provide at scale (with some exception in the hardest, frontier reasoning problems). For preference data in RLHF, the picture is more mixed: academic work shows synthetic preference data performs comparably, yet frontier labs still treat human preference data as a competitive moat. For evaluation, the split takes a different flavor: LLM-as-a-judge scales the scoring of model outputs cost-effectively, but the underlying benchmarks and ground-truth labels still require human creation. The pattern is that synthetic data dominates where models exceed human reliability, while humans remain essential at capability frontiers, for establishing ground truth, and for guiding training.
The term distillation has been the most powerful form of discussion around the role of synthetic data in language models. Distillation as a term comes from a technical definition of teacher-student knowledge distillation from the deep learning literature [10].
Distillation colloquially refers to using the outputs from a stronger model to train a smaller model. In post-training, this general notion of distillation takes two common forms:
- As a data engine to use across wide swaths of the post-training process: Completions for instructions, preference data (or Constitutional AI), or verification for RL.
- To transfer specific skills from a stronger model to a weaker model, which is often done for specific skills such as mathematical reasoning or coding.
The first strategy has grown in popularity as language models evolved to be more reliable than humans at writing answers to a variety of tasks. GPT-4 class models expanded the scope of this to use distillation of stronger models for complex tasks such as math and code (as mentioned above). Here, distillation motivates having a model suite where often a laboratory will train a large internal model, such as Claude Opus or Gemini Ultra, which is not released publicly and just used internally to make stronger models. With open models, common practice is to distill training data from closed API models into smaller, openly available weights [11]. Within this, curating high-quality prompts and filtering responses from the teacher model is crucial to maximize performance.
Transferring specific skills into smaller language models uses the same principles of distillation – get the best data possible for training. Here, many papers have studied using limited datasets from stronger models to improve alignment [12], mathematical reasoning [13] [14], and test-time scaling [15].