Reinforcement Learning from Human Feedback

A short introduction to RLHF and post-training focused on language models.

Nathan Lambert

Chapter Contents

Tool Use & Function Calling

Language models using tools is a natural way to expand their capabilities, especially for high-precision tasks where external tools contain the information or for agents that need to interact with complex web systems. These can be thought of in a few strategies where tool use is the general category. An AI model uses any external tools by outputting special tokens to trigger a certain endpoint. These can be anything from highly specific tools, such as functions that return the weather at a specific place, to code interpreters or search engines that act as fundamental building blocks of complex behaviors.

The exact origin of the term “tool use” is not clear, but the origins of the idea far predates the post ChatGPT world where RLHF proliferated. Early examples circ 2015 attempted to build systems predating modern language models, such as Neural Programmer‑Interpreters (NPI) [1], “a recurrent and compositional neural network that learns to represent and execute programs.” As language models became more popular, many subfields were using integrations with external capabilities to boost performance. To obtain information outside of just the weights many used retrieval augmented generation [2] or web browsing [3]. Soon after, others were exploring language models integrated with programs [4] or tools [5].

As the field matured, these models gained more complex abilities in addition to the vast improvements to the underlying language modeling. For example, ToolFormer could use “a calculator, a Q&A system, two different search engines, a translation system, and a calendar” [6]. Soon after, Gorilla was trained to use 1645 APIs (from PyTorch Hub, TensorFlow Hub v2, and HuggingFace) and its evaluation APIBench became a foundation of the popular Berkeley Function Calling Leaderboard [7]. Since these early models, the diversity of actions called has grown substantially.

Tool-use models are not deeply intertwined with regular language model interactions. Model Context Protocol (MCP) emerged as a common formatting used to connect language models to external data sources (or tools) [8]. With stronger models and better formats, tool-use language models are used in many situations, including productivity copilots within popular applications such as Microsoft Office or Google Workspace, scientific domains [9], medical domains [10], coding agents [11] such as Claude Code or Cursor, integrations with databases, and many other autonomous workflows.

Interweaving Tool Calls in Generation

Function calling agents are presented data very similarly to other post-training stages. The addition is the content in the system prompt that instructs the model what tools it has available. An example formatted data point with the system prompt and tools available in JSON format is shown below:

<system>
You are a function-calling AI model. You are provided with function signatures within <functions></functions> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
</system>

<functions>
[
  {
    "name": "get_id",
    "description": "Fetches the ID of a movie based on the given search query from the RapidAPI similar movies service.",
    "parameters": {
      "q": {
        "description": "The search string for the movie title.",
        "type": "str",
        "default": "titanic"
      }
    }
  },
  {
    "name": "search_torrents",
    "description": "Search for torrents based on given keywords using the RapidAPI service.",
    "parameters": {
      "keywords": {
        "description": "Keywords to search for torrents.",
        "type": "str",
        "default": "Meg 2 The Trench"
      },
      "quantity": {
        "description": "Number of torrent results to return. Maximum value is 40.",
        "type": "int",
        "default": 40
      },
      "page": {
        "description": "Page number for paginated results. Defaults to 1.",
        "type": "int",
        "default": 1
      }
    }
  },
  {
    "name": "basic_info",
    "description": "Fetches detailed information about a cast member such as name, profession, birth and death year, bio, poster, and best titles.",
    "parameters": {
      "peopleid": {
        "description": "The ID of the cast member whose details are to be fetched.",
        "type": "str",
        "default": "nm0000375"
      }
    }
  }
]
</functions>

<user>
...
</user>

While the language model is generating, if following the above example, it would generate the tokens search_torrents("Star Wars") to search for star wars. This is often encoded inside special formatting tokens, and then the next tokens inserted into the sequence will contain the tool outputs. With this, models can learn to accomplish more challenging tasks than many simple standalone models.

A popular form of tool use is code-execution, allowing the model to get precise answers to complex logic or mathematics problems. For example, code-execution within a language model execution can occur during the thinking tokens of a reasoning model. As with function calling, there are tags first for the code to execute (generated by the model) and then a separate tag for output.

<|user|>
What is the 50th number in a fibonacci sequence?</s>
<|assistant|>
<think>
Okay, I will compute the 50‑th Fibonacci number with a simple loop, then return the result.

<code>
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

fib(50)
</code>

<output>
12586269025
</output>
</think>
<answer>
The 50‑th Fibonacci number is 12 586 269 025.
</answer>

Multi-step Tool Reasoning

OpenAI’s o3 model represented a substantial step-change in how multi-step tool-use can be integrated with language models. This behavior is related to research trends much earlier in the community. For example, ReAct [12], showcased how actions and reasoning can be interleaved into one model generation:

In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with and gather additional information from external sources such as knowledge bases or environments.

With the solidification of tool-use capabilities and the take-off of reasoning models, multi-turn tool-use has grown into an exciting area of research [13].

Implementation

There are multiple formatting a masking decisions when implementing a tool-use model:

messages = [
{
"content": "You are a function calling AI model. You are provided with function signatures within <functions></functions> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.",
"function_calls": null,
"functions": "[{\"name\": \"live_giveaways_by_type\", \"description\": \"Retrieve live giveaways from the GamerPower API based on the specified type.\", \"parameters\": {\"type\": {\"description\": \"The type of giveaways to retrieve (e.g., game, loot, beta).\", \"type\": \"str\", \"default\": \"game\"}}}]",
"role": "system"
},
{
"content": "Where can I find live giveaways for beta access and games?",
"function_calls": null,
"functions": null,
"role": "user"
},
{
"content": null,
"function_calls": "live_giveaways_by_type(type='beta')\nlive_giveaways_by_type(type='game')",
"functions": null,
"role": "assistant"
}
]

Bibliography

[1]
S. Reed and N. De Freitas, “Neural programmer-interpreters,” arXiv preprint arXiv:1511.06279, 2015.
[2]
P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020.
[3]
R. Nakano et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.
[4]
L. Gao et al., “Pal: Program-aided language models,” in International conference on machine learning, PMLR, 2023, pp. 10764–10799.
[5]
A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language models,” arXiv preprint arXiv:2205.12255, 2022.
[6]
T. Schick et al., “Toolformer: Language models can teach themselves to use tools.” 2023. Available: https://arxiv.org/abs/2302.04761
[7]
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” arXiv preprint arXiv:2305.15334, 2023.
[8]
Anthropic, “Model context protocol (MCP).” https://modelcontextprotocol.io/, 2024.
[9]
A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller, “Chemcrow: Augmenting large-language models with chemistry tools,” arXiv preprint arXiv:2304.05376, 2023.
[10]
B. Li et al., “Mmedagent: Learning to use medical tools with multi-modal agent,” arXiv preprint arXiv:2407.02483, 2024.
[11]
K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin, “Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,” arXiv preprint arXiv:2401.07339, 2024.
[12]
S. Yao et al., “React: Synergizing reasoning and acting in language models,” in International conference on learning representations (ICLR), 2023.
[13]
Z. Wang et al., “RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning.” 2025. Available: https://arxiv.org/abs/2504.20073
← Previous: Reasoning & Inference-Time Scaling Next: Synthetic Data & Distillation →