For CohereAI PlatformTool Selection

Tool Use And Function Calling

Cohere API · Cohere

Foundation Model & API — Cohere

Evaluates Cohere's Tool Use / Function Calling across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About Cohere

Cohere builds enterprise foundation models and the tools around them — the Command model family, best-in-class Rerank and Embed endpoints, and grounded retrieval-augmented generation with inline citations — deployable across major clouds and private VPCs.

Employees

~400

Industry

Foundation Model

Headquarters

Toronto, Canada

Website

cohere.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent declares tools for /v2/chat but writes vague, parameterless function descriptions, so the model cannot decide when or how to call them.	Declare each tool with a clear name, a description of when to use it, and a typed parameter schema (names, types, required flags). The model routes on the description and constructs arguments from the parameter schema — both must be specific.	Pass / FailAi Platformhigh
02	A /v2/chat response has finish_reason indicating a tool call and an assistant message carrying tool_calls, but the agent only checks for message text.	When the model requests tools, read the assistant message's tool_calls (each with a call id, function name, and arguments), execute them, and continue the conversation. Do not treat a tool-call turn as an empty/text answer.	Pass / FailAi Platformcritical
03	The model emits two tool_calls in one turn; the agent returns a single merged tool result message with no per-call id.	Return one tool result message per tool call, each carrying the tool_call_id that matches the originating call. Pairing is by id, not by order — never merge multiple tool results into one untagged message.	Pass / FailAi Platformcritical
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Cohere
Ai Platform
Tool Use And Function Calling

Recommended for

Cohere APICohere customers

Works with

Cohere

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Tool Use And Function Calling eval for Cohere Cohere API test?+

Evaluates Cohere's Tool Use / Function Calling across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

How is the Tool Use And Function Calling eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Tool Use And Function Calling pack for Cohere Cohere API contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Tool Use And Function Calling pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.