Question 1

What does the Fireworks Function Calling Tool Orchestration eval for Fireworks AI Fireworks AI test?

Accepted Answer

Evaluates Fireworks AI's Function Calling & Tool Orchestration across 12 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI infrastructure eval coverage.

Question 2

How is the Fireworks Function Calling Tool Orchestration eval scored?

Accepted Answer

The judge rubric: Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain.

Question 3

How many test cases does this eval pack include?

Accepted Answer

The Fireworks Function Calling Tool Orchestration pack for Fireworks AI Fireworks AI contains 12 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

Question 4

How do I run this eval?

Accepted Answer

Sign up for Corsac, connect your model or agent endpoint, and run the Fireworks Function Calling Tool Orchestration pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

#	Input	Expected behavior	Check
01	Autonomous agent might loop tools indefinitely; budget guard needed.	Implement loop with max five assistant-tool cycles; abort with user-visible message when cap hit; log usage fields each round.	Pass / FailTool usehigh
02	Power user says always call record_metric; default tool_choice auto sometimes answers without tools.	Set tool_choice to required with named function when user mandates tool execution; otherwise keep auto for mixed Q&A.	Pass / FailTool usemedium
03	Model emits tool_call with malformed JSON args for create_ticket; client must not crash.	Validate tool arguments against parameters schema client-side; on failure return tool error message to model for self-correction without executing side effect.	Pass / FailTool usehigh
Unlock full benchmark 9 more test cases Use this benchmark

Fireworks Function Calling Tool Orchestration

About Fireworks AI

Sample tests· showing 3 of 12

How this eval is graded

Rubric criteria

Recommended for

Works with

Related evals

Claude API

Claude API

Claude API

Frequently asked questions