Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Reasoning Training & Inference-Time Scaling

Reasoning models and inference-time scaling enabled a massive step in language model performance in the end of 2024, through 2025, and into the future. Inference-time scaling is the underlying property of machine learning systems that language models trained to think extensively before answering exploit so well. These models, trained with a large amount of reinforcement learning with verifiable rewards (RLVR) [1], still utilize large amounts of RLHF. In this chapter we review the path that led the AI community to a transformed appreciation for RL’s potential in language models, review the fundamentals of RLVR, highlight key works, and point to the future debates that will define the area in the next few years.

To start, at the 2016 edition of the Neural Information Processing Systems (NeurIPS) conference, Yann LeCun first introduced his now-famous cake metaphor for where learning happens in modern machine learning systems:

If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).

This analogy is now largely complete with modern language models and recent changes to the post-training stack. RLHF was the precursor to this, and RL for reasoning models, primarily on math, code, and science topics, was its confirmation. In this analogy:

This little bit of reasoning training emerged with thinking models that use a combination of the post-training techniques discussed in this book to align preferences along with RL training on verifiable domains to dramatically increase capabilities such as reasoning, coding, and mathematics problem solving.

The training method for these models, Reinforcement Learning with Verifiable Rewards (RLVR) [1], proceeds very similarly to RLHF, but it makes the reward model optional in lieu of a scoring function that returns a positive reward when the answer is correct and 0 otherwise.

For example, consider how different it is to score responses for RLHF versus RLVR. In RLHF, a reward model must evaluate subjective qualities:

Prompt: Explain the concept of opportunity cost in economics.

Response: Opportunity cost is the value of the next best alternative you give up when making a decision. For example, if you spend an hour studying instead of working, the opportunity cost is the wages you could have earned…

Scoring this response requires judging clarity, accuracy, completeness, and helpfulness – all qualities that require learned preferences and lack a definitive correct answer.

In contrast, RLVR uses verification functions that return definitive scores. For mathematics:

Prompt: What is the sum of all prime numbers less than 20?

Response: The prime numbers less than 20 are 2, 3, 5, 7, 11, 13, 17, and 19. Adding them: 2 + 3 = 5, then 5 + 5 = 10, then 10 + 7 = 17, then 17 + 11 = 28, then 28 + 13 = 41, then 41 + 17 = 58, and finally 58 + 19 = 77. The answer is \(\boxed{77}\).

Verification: extracted_answer == 77 → Reward = 1

The \boxed{} notation is a convention adopted from mathematical typesetting that makes answer extraction straightforward – a simple regular expression can pull the final answer from the response, regardless of how the model arrived at it. Note that other answer extraction methods exist, such as just using the phrase “The answer is:” as also shown above, special tokens like <answer>, or delimiters like ####.

For code generation, verification often takes the form of unit tests:

Prompt: Write a Python function fib(n) that returns the nth Fibonacci number, where fib(0) = 0 and fib(1) = 1.

Response: def fib(n): if n < 2: return n return fib(n - 1) + fib(n - 2)

Verification (unit tests):

assert fib(0) == 0 # base case assert fib(1) == 1 # base case assert fib(10) == 55 # larger value (All tests pass → Reward = 1)

Unit tests are the natural verification function for code: they execute the model’s solution against known input-output pairs. A common form of scoring is to perform the simple gating: If all assertions pass, the reward is 1; if any fail, the reward is 0. Other setups use partial credit proportional to the amount of tests passed. For both these examples, no learned reward model is needed and most setups go without one (because the models are robust to over-optimization in these domains), but one can be used with a linear combination of rewards.

The ideas behind RLVR are not new to the RL literature, where the core idea of taking gradient updates on if the answer is correct is almost the textbook definition of reinforcement learning. The innovations when applying this to language models are largely how to apply it while maintaining the strong, general capabilities of the model being fine-tuned. Within that, there are many related ideas in the language modeling literature where the model learns from feedback regarding the correctness of the answer.

Originally, in the work I was a part of that coined RL with Verifiable Rewards (RLVR) [1], the method was to be named RL with Ground Truth rewards (RLGT). However, RLVR is subtly different from learning solely from ground truth answers. In domains like mathematics, a single ground truth answer is available to verify solutions, as we saw above. In other domains, such as code generation or precise instruction following, answers can be verified with a checking function (e.g., a unit test), even when there are multiple correct solutions rather than just a single ground truth answer. The core to progress on RLVR is having a variety and depth of these verifiable problems, even if the exact solution isn’t known a priori.

Figure 1: RLVR in the form of an RL feedback loop. Instead of a reward model, we use a verification function.

The first models to successfully deploy this type of training were OpenAI’s o1 [2] and the open-weight model DeepSeek R1 [3]. Soon after, the entire AI industry prioritized this training process and model style. The core change here is more of a reallocation of the stages of training and the priority of different behaviors rather than this type of RL setup being entirely new. Reasoning models brought an era where scaling RL training is expected.

As for the type of behavior these models accrue, consider the following example with DeepSeek V3 0325 versus their reasoning model, DeepSeek R1, on the query Write me a short poem about a goldfish. DeepSeek V3 is very to the point:

Goldfish Dreams

In a bowl of glass so clear and round, A tiny goldfish swims without a sound. With scales that gleam like sunlight’s kiss, He drifts along in watery bliss.

A flick of fin, a flash so bright, Chasing bubbles in the light. Does he dream of oceans wide? Or is his little world enough—his pride?

Meanwhile, DeepSeek R1’s experience is different. It first generates numerous thinking tokens to craft a response before responding to the user. The answer appears as:

<thinking> Okay, the user wants a short poem about a goldfish. Let me start by brainstorming some goldfish-related imagery. Goldfish are often in bowls, right? So maybe start with the setting. Words like “gilded bowl” or “crystal globe” could work.

Next, think about the movement of the goldfish. They swim smoothly, so words like “dart”, “glide”, or “swirl” come to mind. Maybe describe their fins as “silken” to emphasize delicacy.

Then, consider the environment around the goldfish. Sunlight through water creates shimmering effects. Words like “sunbeams dance” or “ripples spin” might capture that. Including elements like castles or pebbles adds a playful touch.

Themes? Maybe the contrast between the goldfish’s small world and its perception of freedom. The bowl as a universe, tiny but complete. Ending with a sense of mystery or silent observation could give depth. Need to keep it concise, maybe four stanzas with a rhyme scheme. Let me check the rhythm and flow. Avoid clichés but keep it accessible. Alright, let’s piece it together. </thinking>

Gilded Dreams in Liquid Glass

A gilded bowl, a crystal globe, Where tiny tides in silence flow— A flash of orange, a whispered robe, You dart through sunlit indigo.

Your world is small, yet boundless too— Each pebble, castle, weed unfurled. The surface bends the sky to blue, And ripples spin your liquid world.

You glide through myths we’ll never know, A silent monarch, veiled in mist. Do dreams drift where the bubbles go? Or memories in amethyst?

We watch you trace your endless sphere— A universe within a tear.

The tokens in the <thinking> tags are the model’s reasoning. For more complex problems the reasoning stage can take thousands of tokens before producing an answer. So, long-context language models were a prerequisite to advanced reasoning behavior, but that is not the focus of this chapter.

The core intuition for how this training works is that for a given model, we repeat the following cycle:

  1. Sample multiple answers to multiple questions,
  2. Take gradient steps towards the answers that are correct, and
  3. Repeat, revisiting the same data.

Remarkably, this extremely simple approach (when done with a careful distribution of data and stable training infrastructure) helps the models learn by revisiting the same questions again and again. Even more remarkable is that the improvements on these training questions generalize to questions and (some) domains the models have never seen!

This simple approach allows the models to lightly search over behavior space and the RL algorithm increases the likelihood of behaviors that are correlated with correct answers.

The Origins of New Reasoning Models

Here we detail the high-level trends that led to the explosion of reasoning models in 2025.

Why Does RL Work Now?

Despite many, many takes that “RL doesn’t work yet” [4] or papers detailing deep reproducibility issues with RL [5], the field overcame it to find high-impact applications. Some are covered in this book, such as ChatGPT’s RLHF and DeepSeek R1’s RLVR, but many others exist, including improving chip design [6], mastering video gameplay [7], self-driving [8], and more. The takeoff of RL-focused training on language models indicates steps in many fundamental issues for the research area, including:

Multiple resources point to RL training for reasoning only being viable on leading models coming out from about 2024 onwards, indicating that a certain level of underlying capability was needed in the models before reasoning training was possible.

RL Training vs. Inference-time Scaling

Training with Reinforcement Learning to elicit reasoning behaviors and performance on verifiable domains is closely linked to the ideas of inference-time scaling. Inference-time scaling, also called test-time scaling, is the general class of methods that use more computational power at inference in order to perform better at downstream tasks. Methods for inference-time scaling were studied before the release of DeepSeek R1 and OpenAI’s o1, which both massively popularized investment in RL training specifically. Examples include value-guided sampling [12] or repeated random sampling with answer extraction [13]. Beyond this, inference-time scaling can be used to improve more methods of AI training beyond chain-of-thought reasoning to solve problems, such as with reward models that consider the options deeply [14] [15].

RL training is a short path to inference-time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance. Training models heavily with RL often enables them to generate more tokens per response in a way that is strongly correlated with improved, downstream performance (though, while this sequence length increase is the default, research also exists explicitly on improving performance without relying on this inference-time scaling). This is a substantial shift from the length-bias seen in early RLHF systems [16], where the human preference training had a side effect of increasing the response average length for marginal gains on preference rankings.

Other than the core RL trained models there are many methods being explored to continue to push the limits of reasoning and inference-time compute. These are largely out of the scope of this book due to their rapidly evolving nature, but they include distilling reasoning behavior from a larger RL trained model to a smaller model via instruction tuning [17], composing more inference calls [18], and more. What is important here is the correlation between downstream performance and an increase in the number of tokens generated – otherwise it is just wasted energy.

The Future (Beyond Reasoning) of RLVR

In many domains, these new flavors of RLVR are much more aligned with the goals of developers by being focused on performance rather than behavior. Standard fine-tuning APIs generally use a parameter-efficient fine-tuning method such as LoRA with supervised fine-tuning on instructions. Developers pass in prompts and completions and the model is tuned to match that by updating model parameters to match the completions, which increases the prevalence of features from your data in the model’s generations.

RLVR is focused on matching answers. Given queries and correct answers, RLVR helps the model learn to get the correct answers. While standard instruction tuning is done with 1 or 2 epochs of loss updates over the data, RLVR gets its name by doing hundreds or thousands of epochs over the same few data points to give the model time to learn new behaviors. This can be viewed as reinforcing positive behaviors that would work sparingly in the base model version into robust behaviors after RLVR.

The scope of RL training for language models continues to grow: The biggest takeaway from o1 and R1 on a fundamental scientific level was that we have even more ways to train language models to potentially valuable behaviors. The more open doors that are available to researchers and engineers, the more optimism we should have about AI’s general trajectory.

Understanding Reasoning Training Methods

The investment in reasoning has instigated a major evolution in the art of how models are trained to follow human instructions. These recipes still use the common pieces discussed in earlier chapters (as discussed in Chapter 4 with the overview of DeepSeek R1’s recipe), including instruction fine-tuning, reinforcement learning from human feedback, and reinforcement learning with verifiable rewards (RLVR). The core change is using far more RLVR and applying the other training techniques in different orders – traditionally for a reasoning model the core training step is either a large-scale RL run or a large-scale instruction tuning run on outputs of another model that had undergone a substantial portion of RLVR training (referred to as distillation).

Reasoning Research Pre OpenAI’s o1 or DeepSeek R1

Before the takeoff of reasoning models, a substantial effort was made understanding how to train language models to be better at verifiable domains. The main difference between these works below is that their methodologies did not scale up to the same factor as those used in DeepSeek R1 and subsequent models, or they resulted in models that made sacrifices in overall performance in exchange for higher mathematics or coding abilities. The underlying ideas and motivations are included to paint a broader picture for how reasoning models emerged within the landscape.

Some of the earliest efforts training language models on verifiable domains include self-taught reasoner (STaR) line of work[19] [20] and TRICE [21], which both used ground-truth reward signals to encourage chain-of-thought reasoning in models throughout 2022 and 2023. STaR effectively approximates the policy gradient algorithm, but in practice filters samples differently and uses a cross-entropy measure instead of a log-probability, and Quiet-STaR expands on this with very related ideas of recent reasoning models by having the model generate tokens before trying to answer the verifiable question (which helps with training performance). TRICE [21] also improves upon reasoning by generating traces and then optimizing with a custom Markov chain Monte Carlo inspired expectation maximization algorithm. VinePPO [22] followed these and used a setup that shifted closer to modern reasoning models. VinePPO uses a PPO-based algorithm with binary rewards for math question correctness, training on GSM8K and MATH. Other work before OpenAI’s o1 and DeepSeek R1 used code execution as a feedback signal for training [23], [24] or verification for theorem proving (called Reinforcement Learning from Verifier Feedback, RLVF, here) [25]. Tülu 3 expanded upon these methods by using a simple PPO trainer to reward completions with correct answers – most importantly while maintaining the model’s overall performance on a broad suite of evaluations. The binary rewards of Tülu 3 and modern reasoning training techniques can be contrasted to the iterative approach of STaR or the log-likelihood rewards of Quiet-STaR.

Early Reasoning Models

A summary of the foundational reasoning research reports, some of which are accompanied by open data and model weights, following DeepSeek R1 is below.

Table 1: A summary of the notable reasoning model technical reports in 2025, the first year of substantial inference-time scaling with RLHF.
Date Name TLDR Open weights Open data
2025-01-22 DeepSeek R1 [3] RL-based upgrade to DeepSeek, big gains on math & code reasoning Yes No
2025-01-22 Kimi 1.5 [26] Scales PPO/GRPO on Chinese/English data; strong AIME maths No No
2025-03-31 Open-Reasoner-Zero [27] Fully open replication of base model RL Yes Yes
2025-04-10 Seed-Thinking 1.5 [28] ByteDance RL pipeline with dynamic CoT gating Yes No
2025-04-30 Phi-4 Reasoning [29] 14B model; careful SFT→RL; excels at STEM reasoning Yes No
2025-05-02 Llama-Nemotron [30] Multi-size “reasoning-toggle” models Yes Yes
2025-05-12 INTELLECT-2 [31] First, publicly documented globally-decentralized RL training run Yes Yes
2025-05-12 Xiaomi MiMo [32] End-to-end reasoning pipeline from pre- to post-training Yes No
2025-05-14 Qwen 3 [33] Similar to R1 recipe applied to new models Yes No
2025-05-21 Hunyuan-TurboS [34] Mamba-Transformer MoE, adaptive long/short CoT No No
2025-05-28 Skywork OR-1 [35] RL recipe avoiding entropy collapse; beats DeepSeek on AIME Yes Yes
2025-06-04 Xiaomi MiMo VL [36] Adapting reasoning pipeline end-to-end to include multi-modal tasks Yes No
2025-06-04 OpenThoughts [37] Public 1.2M-example instruction dataset distilled from QwQ-32B Yes Yes
2025-06-10 Magistral [38] Pure RL on Mistral 3; multilingual CoT; small model open-sourced Yes No
2025-06-16 MiniMax-M1 [39] Open-weight 456B MoE hybrid/Lightning Attention reasoning model; 1M context; RL w/CISPO; releases 40K/80K thinking-budget checkpoints Yes No
2025-07-10 Kimi K2 [40] 1T MoE (32B active) with MuonClip (QK-clip) for stability; 15.5T token pretrain without loss spikes; multi-stage post-train with agentic data synthesis + joint RL; releases base + post-trained checkpoints. Yes No
2025-07-28 GLM-4.5 [41] Open-weight 355B-A32B MoE “ARC” model with thinking/non-thinking modes; 23T-token multi-stage training + post-train w/ expert iteration and RL; releases GLM-4.5 + GLM-4.5-Air (MIT). Yes No
2025-08-20 Nemotron Nano 2 [42] Hybrid Mamba-Transformer for long “thinking traces”; FP8 pretraining at 20T tokens then compression/distillation; explicitly releases multiple checkpoints plus “majority” of pre/post-training datasets. Yes Yes (most)
2025-09-09 K2-Think [43] Parameter-efficient math reasoning system: a 32B open-weights model with test-time scaling recipe; positioned as fully open incl. training data/code (per release materials). Yes Yes
2025-09-23 LongCat-Flash-Thinking [44] 560B MoE reasoning model; report is explicit about a staged recipe from long-CoT cold start to large-scale RL; open-source release. Yes No
2025-10-21 Ring-1T [45] Trillion-scale “thinking model” with RL scaling focus; report frames bottlenecks/solutions for scaling RL at 1T and releases an open model. Yes No
2025-11-20 OLMo 3 Think [46] Fully open “model flow” release: reports the entire lifecycle (stages, checkpoints, and data points) and positions OLMo 3 Think 32B as a flagship open thinking model. Yes Yes
2025-12-02 DeepSeek V3.2 [47] Open-weight MoE frontier push with a report that foregrounds attention efficiency changes, RL framework upgrades, and data synthesis for agentic/reasoning performance. Yes No
2025-12-05 K2-V2 [48] 70B dense “360-open” model trained from scratch; with 3-effort SFT-only post-training for controllable thinking. Yes Yes
2025-12-15 Nemotron 3 Nano [49] 30B-A3B MoE hybrid Mamba-Transformer; pretrain on 25T tokens and includes SFT + large-scale RL; explicitly states it ships weights + recipe/code + most training data. Yes Yes (most)
2025-12-16 MiMo-V2-Flash [50] 309B MoE (15B active) optimized for speed: hybrid SWA/GA attention (5:1, 128-token window) + lightweight MTP; FP8 pretrain on 27T tokens; post-train with MOPD + large-scale agentic RL for reasoning/coding. Yes No

Common Practices in Training Reasoning Models

In this section we detail common methods used to sequence training stages and modify data to maximize performance when training a reasoning model.

Note that these papers could have used a listed technique and not mentioned it while their peers do, so these examples are a subset of known implementations and should be used as reference, but not a final proclamation on what is an optimal recipe.

In complement to the common techniques, there are also many common findings on how reasoning training can create useful models without sacrificing ancillary capabilities:

Looking Ahead

The landscape of reasoning models is evolving faster than any area of AI research in recent memory. By the time this chapter is published, the table of reasoning models above will be incomplete and some of the common practices listed may have been superseded by new techniques.

Several efforts are underway to systematically understand what makes reasoning training work. OLMo 3 Think [46] represents the most comprehensive open documentation of a reasoning model’s full training lifecycle, providing checkpoints and data at each stage for the research community to study, and concluding with a nearly 4 week long training run on 220 GPUs. Similarly, work on understanding the scaling properties of RL for reasoning [57] is beginning to formalize relationships between compute, data, and performance that were previously only intuited by practitioners.

What remains clear is that reinforcement learning has graduated from the “cherry on top” of the cake metaphor to a load-bearing component of frontier model training. The minor techniques in this chapter around the idea of RLVR – difficulty filtering, format rewards, and the rest – are not the final answers, but they represent the field’s current best understanding of how to elicit reasoning from language models. The next generation of methods will likely look different, but they will build on the foundations established here.

Bibliography

[1]
N. Lambert et al., “Tulu 3: Pushing frontiers in open language model post-training,” arXiv preprint arXiv:2411.15124, 2024.
[2]
OpenAI, “Introducing OpenAI o1-preview.” Sept. 2024. Available: https://openai.com/index/introducing-openai-o1-preview/
[3]
D. Guo et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025.
[4]
A. Irpan, “Deep reinforcement learning doesn’t work yet.” 2018. Available: https://www.alexirpan.com/2018/02/14/rl-hard.html
[5]
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Proceedings of the AAAI conference on artificial intelligence, 2018. Available: https://ojs.aaai.org/index.php/AAAI/article/view/11694
[6]
A. Mirhoseini et al., “Chip placement with deep reinforcement learning,” in Design, automation and test in europe (DATE), 2023.
[7]
J. Schrittwieser et al., “Mastering atari, go, chess and shogi by planning with a learned model,” Nature, vol. 588, no. 7839, pp. 604–609, 2020.
[8]
M. Cusumano-Towner et al., “Robust autonomy emerges from self-play,” in International conference on machine learning (ICML), 2025.
[9]
L. von Werra et al., “TRL: Transformer reinforcement learning,” GitHub repository. https://github.com/huggingface/trl; GitHub, 2020.
[10]
G. Sheng et al., “HybridFlow: A flexible and efficient RLHF framework,” in European conference on computer systems (EuroSys), 2025.
[11]
J. Hu et al., “OpenRLHF: An easy-to-use, scalable and high-performance RLHF framework,” arXiv preprint arXiv:2405.11143, 2024.
[12]
J. Liu, A. Cohen, R. Pasunuru, Y. Choi, H. Hajishirzi, and A. Celikyilmaz, “Don’t throw away your value model! Generating more preferable text with value-guided monte-carlo tree search decoding,” arXiv preprint arXiv:2309.15028, 2023.
[13]
B. Brown et al., “Large language monkeys: Scaling inference compute with repeated sampling,” arXiv preprint arXiv:2407.21787, 2024.
[14]
Z. Ankner, M. Paul, B. Cui, J. D. Chang, and P. Ammanabrolu, “Critique-out-loud reward models,” arXiv preprint arXiv:2408.11791, 2024.
[15]
Z. Liu et al., “Inference-time scaling for generalist reward modeling,” arXiv preprint arXiv:2504.02495, 2025.
[16]
P. Singhal, T. Goyal, J. Xu, and G. Durrett, “A long way to go: Investigating length correlations in rlhf,” arXiv preprint arXiv:2310.03716, 2023.
[17]
N. Muennighoff et al., “s1: Simple test-time scaling,” arXiv preprint arXiv:2501.19393, 2025.
[18]
L. Chen et al., “Are more llm calls all you need? Towards scaling laws of compound inference systems,” arXiv preprint arXiv:2403.02419, 2024.
[19]
E. Zelikman, Y. Wu, J. Mu, and N. Goodman, STaR: Bootstrapping reasoning with reasoning,” in Advances in neural information processing systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. Available: https://openreview.net/forum?id=_3ELRdg2sgI
[20]
E. Zelikman, G. Harik, Y. Shao, V. Jayasiri, N. Haber, and N. D. Goodman, “Quiet-STaR: Language models can teach themselves to think before speaking,” COLM, vol. abs/2403.09629, 2024.
[21]
M. D. Hoffman et al., “Training chain-of-thought via latent-variable inference,” in Thirty-seventh conference on neural information processing systems, 2023. Available: https://openreview.net/forum?id=a147pIS2Co
[22]
A. Kazemnejad et al., “VinePPO: Unlocking RL potential for LLM reasoning through refined credit assignment.” 2024. Available: https://arxiv.org/abs/2410.01679
[23]
J. Gehring, K. Zheng, J. Copet, V. Mella, T. Cohen, and G. Synnaeve, “RLEF: Grounding code LLMs in execution feedback with reinforcement learning,” in International conference on machine learning (ICML), 2025. Available: https://arxiv.org/abs/2410.02089
[24]
S. Xu et al., “Is DPO superior to PPO for LLM alignment? A comprehensive study,” in ICML, 2024. Available: https://openreview.net/forum?id=6XH8R7YrSk
[25]
N. Amit, S. Goldwasser, O. Paradise, and G. Rothblum, “Models that prove their own correctness,” Electron. Colloquium Comput. Complex., 2024.
[26]
K. Team et al., “Kimi k1. 5: Scaling reinforcement learning with llms,” arXiv preprint arXiv:2501.12599, 2025.
[27]
J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H. Shum, “Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model,” arXiv preprint arXiv:2503.24290, 2025.
[28]
B. Seed et al., “Seed-thinking-v1. 5: Advancing superb reasoning models with reinforcement learning,” arXiv preprint arXiv:2504.13914, 2025.
[29]
M. Abdin, S. Agarwal, A. Awadallah, et al., “Phi-4-reasoning technical report,” arXiv preprint arXiv:2504.21318, 2025.
[30]
A. Bercovich, I. Levy, I. Golan, et al., “Llama‑nemotron: Efficient reasoning models,” arXiv preprint arXiv:2505.00949, 2025.
[31]
P. I. Team et al., “INTELLECT-2: A reasoning model trained through globally decentralized reinforcement learning.” 2025. Available: https://arxiv.org/abs/2505.07291
[32]
B. Xia et al., “MiMo: Unlocking the reasoning potential of language model–from pretraining to posttraining,” arXiv preprint arXiv:2505.07608, 2025.
[33]
A. Yang et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025.
[34]
A. Liu, B. Zhou, C. Xu, et al., “Hunyuan‑TurboS: Advancing large language models through mamba‑transformer synergy and adaptive chain‑of‑thought,” arXiv preprint arXiv:2505.15431, 2025.
[35]
J. He, J. Liu, C. Y. Liu, et al., “Skywork open reasoner 1 technical report,” arXiv preprint arXiv:2505.22312, 2025.
[36]
C. Team et al., “MiMo-VL technical report.” 2025. Available: https://arxiv.org/abs/2506.03569
[37]
E. Guha, R. Marten, S. Keh, et al., “OpenThoughts: Data recipes for reasoning models,” arXiv preprint arXiv:2506.04178, 2025.
[38]
Mistral AI, “Magistral: Scaling reinforcement learning for reasoning in large language models,” Mistral AI, 2025. Available: https://mistral.ai/static/research/magistral.pdf
[39]
MiniMax, “MiniMax-M1: Scaling test-time compute efficiently with lightning attention.” 2025. doi: 10.48550/arXiv.2506.13585.
[40]
K. Team et al., “Kimi K2: Open agentic intelligence.” 2025. Available: https://arxiv.org/abs/2507.20534
[41]
A. Zeng et al., “GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.” 2025. doi: 10.48550/arXiv.2508.06471.
[42]
NVIDIA, “NVIDIA nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model.” 2025. Available: https://arxiv.org/abs/2508.14444
[43]
Z. Cheng et al., “K2-think: A parameter-efficient reasoning system.” 2025. Available: https://arxiv.org/abs/2509.07604
[44]
M. L. Team, “Introducing LongCat-flash-thinking: A technical report.” 2025. Available: https://arxiv.org/abs/2509.18883
[45]
L. Team et al., “Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.” 2025. Available: https://arxiv.org/abs/2510.18855
[46]
T. Olmo et al., “Olmo 3.” 2025. Available: https://arxiv.org/abs/2512.13961
[47]
DeepSeek-AI, “DeepSeek-V3.2: Pushing the frontier of open large language models.” 2025. Available: https://arxiv.org/abs/2512.02556
[48]
Z. Liu et al., “K2-V2: A 360-open, reasoning-enhanced LLM,” arXiv preprint arXiv:2512.06201, 2025.
[49]
NVIDIA, “Nemotron 3 nano: Open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning,” NVIDIA, Technical Report, 2025. Available: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf
[50]
L.-C. Xiaomi, “MiMo-V2-flash technical report.” 2025. Available: https://github.com/XiaomiMiMo/MiMo-V2-Flash/blob/main/paper.pdf
[51]
Z. Wang et al., “RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning.” 2025. Available: https://arxiv.org/abs/2504.20073
[52]
Q. Yu et al., “DAPO: An open-source LLM reinforcement learning system at scale.” 2025.
[53]
R. Shao et al., “Spurious rewards: Rethinking training signals in RLVR.” https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking-Training-Signals-in-RLVR-1f4df34dac1880948858f95aeb88872f, 2025.
[54]
Z. Liu et al., “Understanding R1-zero-like training: A critical perspective,” arXiv preprint arXiv:2503.20783, Mar. 2025, Available: https://arxiv.org/abs/2503.20783
[55]
Anthropic, “Claude 4.” May 2025. Available: https://www.anthropic.com/news/claude-4
[56]
P. Aggarwal and S. Welleck, “L1: Controlling how long a reasoning model thinks with reinforcement learning,” arXiv preprint arXiv:2503.04697, 2025.
[57]
D. Khatri et al., “The art of scaling reinforcement learning compute for llms,” arXiv preprint arXiv:2510.13786, 2025.
← Previous: Constitutional AI & AI Feedback Next: Tool Use & Function Calling →