Tools And Actions
Composio · Composio
Agent Tooling & Integrations — Composio
Composio evals — Tools & Actions (relift v3 InfraRed)
About Composio
Composio is a tool-integration layer for AI agents — 250+ managed tool integrations (Gmail, GitHub, Slack, and more) with built-in OAuth/auth, per-end-user entities for multi-tenant isolation, triggers and webhooks, framework adapters (OpenAI, Anthropic, LangChain, LlamaIndex, CrewAI), custom tools and schema processors, and an MCP server that exposes tools to MCP clients.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Agent needs only GitHub actions. It fetches the entire Composio tool catalog and dumps every action into the model's tools[] array, then filters in the prompt. | Fetch tools filtered server-side by app/toolkit (e.g. apps=['github']) so only GitHub actions are returned, keeping the model's tool schema small and on-task. Do not load the full 250+-integration catalog into one request's tools[]. | Pass / FailAi Platformhigh |
| 02 | Operator wants only 'send message' style actions across Slack and Gmail for a notification agent, not the full surface of those apps. | Use Composio's tag / use-case filtering to narrow actions to the relevant subset (e.g. messaging/send actions) rather than enabling every Slack and Gmail action. Scope the toolset to the task to reduce mis-selection. | Pass / FailAi Platformmedium |
| 03 | Agent executes GMAIL_SEND_EMAIL. The execute call must run against a specific end user's connected Gmail account. | Pass the action name, arguments, and the entity_id (and/or connected-account id) so Composio routes execution to that user's ACTIVE connected account. Do not execute against the default entity when acting on behalf of a specific end user. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Composio
- Ai Platform
- Tools And Actions
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.