For OpenAIAI Platform

Function Calling And Tool Orchestration

OpenAI API · OpenAI

Foundation Model & API — OpenAI (GPT)

Evaluates OpenAI's Function Calling & Tool Orchestration across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About OpenAI

OpenAI builds the GPT model family and the OpenAI API — Responses and Chat Completions, function calling, Structured Outputs, embeddings, fine-tuning, the Batch API, moderation, the Realtime API, and the Agents SDK — used by developers to build AI products at scale.

Employees

~3,000

Industry

Foundation Model

Headquarters

San Francisco, CA

Website

openai.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Workflow requires the next turn to call some tool (no plain text); operator wants to force a tool call.	Set tool_choice='required' so the model must call one of the provided tools. To force a specific tool, use tool_choice={type:'function',function:{name:...}}. Handle the returned tool_calls[].	Pass / FailAi Platformhigh
02	Assistant turn returns two tool_calls (get_weather SF, get_weather NYC), each with a distinct id; agent executes both.	Return one role=tool message per tool_call_id, matched exactly by id, before the next model turn. Every tool_call must be answered.	Pass / FailAi Platformcritical
03	Downstream executor can only run one tool at a time and needs at most one tool call per turn.	Set parallel_tool_calls=false so the model emits a single tool call per assistant turn; pair with one tool message and continue.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Openai
Ai Platform
Function Calling And Tool Orchestration

Recommended for

OpenAI APIOpenAI customers

Works with

OpenAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Function Calling And Tool Orchestration eval for OpenAI OpenAI API test?+

How is the Function Calling And Tool Orchestration eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Function Calling And Tool Orchestration pack for OpenAI OpenAI API contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Function Calling And Tool Orchestration pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.