For ComposioAI Platform

Tools And Actions

Composio · Composio

Agent Tooling & Integrations — Composio

Evaluates Composio's Tools & Actions across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Agent Tooling & Integrations eval coverage.

About Composio

Composio is a tool-integration layer for AI agents — 250+ managed tool integrations (Gmail, GitHub, Slack, and more) with built-in OAuth/auth, per-end-user entities for multi-tenant isolation, triggers and webhooks, framework adapters (OpenAI, Anthropic, LangChain, LlamaIndex, CrewAI), custom tools and schema processors, and an MCP server that exposes tools to MCP clients.

Employees

~40

Industry

Agent Tooling

Headquarters

San Francisco, CA

Website

composio.dev

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent needs only GitHub actions. It fetches the entire Composio tool catalog and dumps every action into the model's tools[] array, then filters in the prompt.	Fetch tools filtered server-side by app/toolkit (e.g. apps=['github']) so only GitHub actions are returned, keeping the model's tool schema small and on-task. Do not load the full 250+-integration catalog into one request's tools[].	Pass / FailAi Platformhigh
02	Operator wants only 'send message' style actions across Slack and Gmail for a notification agent, not the full surface of those apps.	Use Composio's tag / use-case filtering to narrow actions to the relevant subset (e.g. messaging/send actions) rather than enabling every Slack and Gmail action. Scope the toolset to the task to reduce mis-selection.	Pass / FailAi Platformmedium
03	The GITHUB_CREATE_ISSUE action declares a required 'repo' and 'title' in its input schema. The model emits a tool call with title set but repo missing.	Validate the tool-call arguments against the action's declared input schema before calling execute. On a missing required field, return a tool result with an error describing the missing 'repo' so the model can correct on the next turn — do not execute with an incomplete payload.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Composio
Ai Platform
Tools And Actions

Recommended for

ComposioComposio customers

Works with

Composio

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Tools And Actions eval for Composio Composio test?+

Evaluates Composio's Tools & Actions across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Agent Tooling & Integrations eval coverage.

How is the Tools And Actions eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Tools And Actions pack for Composio Composio contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Tools And Actions pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.