Tool Use & Function Calling

Language models using tools is a natural way to expand their capabilities, especially for high-precision tasks where external tools contain the information or for agents that need to interact with complex web systems. These can be thought of in a few strategies where tool use is the general category. An AI model uses any external tools by outputting special tokens to trigger a certain endpoint. These can be anything from highly specific tools, such as functions that return the weather at a specific place, to code interpreters or search engines that act as fundamental building blocks of complex behaviors.

The exact origin of the term “tool use” is not clear, but the origins of the idea far predates the post ChatGPT world where RLHF proliferated. Early examples circ 2015 attempted to build systems predating modern language models, such as Neural Programmer‑Interpreters (NPI) [1], “a recurrent and compositional neural network that learns to represent and execute programs.” As language models became more popular, many subfields were using integrations with external capabilities to boost performance. To obtain information outside of just the weights many used retrieval augmented generation [2] or web browsing [3]. Soon after, others were exploring language models integrated with programs [4] or tools [5].

As the field matured, these models gained more complex abilities in addition to the vast improvements to the underlying language modeling. For example, ToolFormer could use “a calculator, a Q&A system, two different search engines, a translation system, and a calendar” [6]. Soon after, Gorilla was trained to use 1645 APIs (from PyTorch Hub, TensorFlow Hub v2, and HuggingFace) and its evaluation APIBench became a foundation of the popular Berkeley Function Calling Leaderboard [7]. Since these early models, the diversity of actions called has grown substantially.

Tool-use models are now deeply intertwined with regular language model interactions. Model Context Protocol (MCP) emerged as a common formatting used to connect language models to external data sources (or tools) [8]. With stronger models and better formats, tool-use language models are used in many situations, including productivity copilots within popular applications such as Microsoft Office or Google Workspace, scientific domains [9], medical domains [10], coding agents [11] such as Claude Code or Cursor, integrations with databases, and many other autonomous workflows.

Interweaving Tool Calls in Generation

Function calling agents are presented data very similarly to other post-training stages. The addition is the content in the system prompt that instructs the model what tools it has available. An example formatted data point with the system prompt and tools available in JSON format is shown below:

<system>
You are a function-calling AI model. You are provided with function signatures within <functions></functions> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.
</system>

<functions>
[
  {
    "name": "get_id",
    "description": "Fetches the ID of a movie based on the given search query from the RapidAPI similar movies service.",
    "parameters": {
      "q": {
        "description": "The search string for the movie title.",
        "type": "str",
        "default": "titanic"
      }
    }
  },
  {
    "name": "search_torrents",
    "description": "Search for torrents based on given keywords using the RapidAPI service.",
    "parameters": {
      "keywords": {
        "description": "Keywords to search for torrents.",
        "type": "str",
        "default": "Meg 2 The Trench"
      },
      "quantity": {
        "description": "Number of torrent results to return. Maximum value is 40.",
        "type": "int",
        "default": 40
      },
      "page": {
        "description": "Page number for paginated results. Defaults to 1.",
        "type": "int",
        "default": 1
      }
    }
  },
  {
    "name": "basic_info",
    "description": "Fetches detailed information about a cast member such as name, profession, birth and death year, bio, poster, and best titles.",
    "parameters": {
      "peopleid": {
        "description": "The ID of the cast member whose details are to be fetched.",
        "type": "str",
        "default": "nm0000375"
      }
    }
  }
]
</functions>

<user>
...
</user>

While the language model is generating, if following the above example, it would generate the tokens search_torrents("Star Wars") to search for Star Wars. This is often encoded inside special formatting tokens, and then the next tokens inserted into the sequence will contain the tool outputs. With this, models can learn to accomplish more challenging tasks than many simple standalone models.

A popular form of tool use is code-execution, allowing the model to get precise answers to complex logic or mathematics problems. For example, code-execution within a language model execution can occur during the thinking tokens of a reasoning model. As with function calling, there are tags first for the code to execute (generated by the model) and then a separate tag for output.

<|user|>
What is the 50th number in a fibonacci sequence?</s>
<|assistant|>
<think>
Okay, I will compute the 50‑th Fibonacci number with a simple loop, then return the result.

<code>
def fib(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

fib(50)
</code>

<output>
12586269025
</output>
</think>
<answer>
The 50‑th Fibonacci number is 12 586 269 025.
</answer>

Multi-step Tool Reasoning

OpenAI’s o3 model represented a substantial step-change in how multi-step tool-use can be integrated with language models. This behavior is related to research trends much earlier in the community. For example, ReAct [12], showcased how actions and reasoning can be interleaved into one model generation:

In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with and gather additional information from external sources such as knowledge bases or environments.

With the solidification of tool-use capabilities and the take-off of reasoning models, multi-turn tool-use has grown into an exciting area of research [13].

Model Context Protocol (MCP)

Model Context Protocol (MCP) is a standard for connecting language models to external data sources and information systems [8]. Rather than focusing on specific tool call formatting per external system, MCP enables models to access rich contextual information through a standardized protocol.

MCP is a simple addition on top of the tool-use content in this chapter – it is how applications pass context (data + actions) to language models in a predictable JSON schema. MCP servers that the models interact with have core primitives: resources (read-only data blobs), prompts (templated messages/workflows), and tools (functions the model can call). With this, the MCP architecture can be summarized as:

MCP servers wrap a specific data source or capability.
MCP clients (e.g., Claude Desktop, IDE plug-ins) aggregate one or more servers.
Hosts, e.g. Claude or ChatGPT applications, provide the user/LLM interface; switching model vendors or back-end tools only means swapping the client in the middle.

MCP enables developers of tool-use models to use the same infrastructure to attach their servers or clients to different models, and at the same time models have a predictable format they can use to integrate external components. These together make for a far more predictable development environment for tool-use models in real-world domains.

Implementation

There are multiple formatting and masking decisions when implementing a tool-use model:

Python vs. JSON formatting: In this chapter, we included examples that format tool use as both JSON data-structures and Python code. Models tend to select one structure, different providers across the industry use different formats.
Masking tool outputs: An important detail when training tool-use models is that the tokens in the tool output are masked from the model’s training loss. This ensures the model is not learning to predict the output of the system that it does not directly generate in use (similar to prompt masking for other post-training stages).
Multi-turn formatting for tool invocations: It is common practice when implementing tool-calling models to add more structure to the dataloading format. Standard practice for post-training datasets is a list of messages alternating between user and assistant (and often a system message). The overall structure is the same for tool-use, but the turns of the model are split into subsections of content delimited by each tool call. An example is below.

messages = [
{
"content": "You are a function calling AI model. You are provided with function signatures within <functions></functions> XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions.",
"function_calls": null,
"functions": "[{\"name\": \"live_giveaways_by_type\", \"description\": \"Retrieve live giveaways from the GamerPower API based on the specified type.\", \"parameters\": {\"type\": {\"description\": \"The type of giveaways to retrieve (e.g., game, loot, beta).\", \"type\": \"str\", \"default\": \"game\"}}}]",
"role": "system"
},
{
"content": "Where can I find live giveaways for beta access and games?",
"function_calls": null,
"functions": null,
"role": "user"
},
{
"content": null,
"function_calls": "live_giveaways_by_type(type='beta')\nlive_giveaways_by_type(type='game')",
"functions": null,
"role": "assistant"
}
]

Tokenization and message format details: Tool calls in OpenAI messages format often undergo tokenization through chat templates (the code for controlling format of messages sent to the model), converting structured JSON representations into raw token streams. This process varies across model architectures—some use special tokens to demarcate tool calls, while others maintain structured formatting within the token stream itself. Chat template playgrounds provides an interactive environment to explore how different models convert message formats to token streams.
Reasoning token continuity: As reasoning models have emerged, with their separate token stream of “reasoning” before an answer, different implementations exist for how they’re handled with tool-use in the loop. Some models preserve reasoning tokens between tool-calling steps within a single turn, maintaining context across multiple tool invocations. However, these tokens are typically erased between turns to minimize serving cost.
API formatting across providers (As of July 2025): Different providers use conceptually similar but technically distinct formats. OpenAI uses tool_calls arrays with unique IDs, Anthropic employs detailed input_schema specifications with <thinking> tags, and Gemini offers function calling modes (AUTO/ANY/NONE). When using these models via an API, the tools available are defined in a JSON format and then the tool outputs in the model response are stored in a separate field from the standard “tokens generated.” For another example, the open-source vLLM inference codebase implements extensive parsing logic supporting multiple tool calling modes and model-specific parsers, providing insights into lower-level implementation considerations [14].

Bibliography

[1]

S. Reed and N. De Freitas, “Neural programmer-interpreters,” arXiv preprint arXiv:1511.06279, 2015.

[2]

P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020.

[3]

R. Nakano et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.

[4]

L. Gao et al., “Pal: Program-aided language models,” in International conference on machine learning, PMLR, 2023, pp. 10764–10799.

[5]

A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language models,” arXiv preprint arXiv:2205.12255, 2022.

[6]

T. Schick et al., “Toolformer: Language models can teach themselves to use tools.” 2023. Available: https://arxiv.org/abs/2302.04761

[7]

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive APIs,” arXiv preprint arXiv:2305.15334, 2023.

[8]

Anthropic, “Model context protocol (MCP).” https://modelcontextprotocol.io/, 2024.

[9]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller, “Chemcrow: Augmenting large-language models with chemistry tools,” arXiv preprint arXiv:2304.05376, 2023.

[10]

B. Li et al., “Mmedagent: Learning to use medical tools with multi-modal agent,” arXiv preprint arXiv:2407.02483, 2024.

[11]

K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin, “Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,” arXiv preprint arXiv:2401.07339, 2024.

[12]

S. Yao et al., “React: Synergizing reasoning and acting in language models,” in International conference on learning representations (ICLR), 2023.

[13]

Z. Wang et al., “RAGEN: Understanding self-evolution in LLM agents via multi-turn reinforcement learning.” 2025. Available: https://arxiv.org/abs/2504.20073

[14]

W. Kwon et al., “Efficient memory management for large language model serving with PagedAttention,” in Proceedings of the ACM SIGOPS 29th symposium on operating systems principles, 2023.

Reinforcement Learning from Human Feedback

Chapter Contents