Eval Library
C
For CrewAIAI Platform

Tools Builtin And Custom

CrewAI · CrewAI

Multi-agent Framework — CrewAI

CrewAI evals — Tools (built-in + custom) (relift v3 InfraRed)

About CrewAI

CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.

Employees

~50

Industry

Agent Framework

Headquarters

San Francisco, CA

Website

crewai.com

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Custom Tool subclass defines name='get_invoice', args_schema=InvoiceArgs(invoice_id: str, format: Literal['pdf','csv']). Agent calls with format='xml'.

CrewAI validates the agent's tool-call args against args_schema (Pydantic). On ValidationError, the framework surfaces a tool-feedback message back to the agent (not a raw stack trace). The agent corrects on the next iteration. Do not execute the tool body with malformed args.

Pass / FailAi Platformcritical
02

Tool body raises ToolException('upstream 503 — retry later'). Agent receives it.

CrewAI catches ToolException and feeds the message back to the agent as tool result. The agent decides whether to retry, switch tools, or report failure. Use ToolException to surface recoverable errors; reserve plain exceptions for framework-level bugs that should abort.

Pass / FailAi Platformhigh
03

Two tools: SerperDevTool and BraveSearchTool. Their descriptions both say 'web search'. Agent picks arbitrarily.

Tool descriptions are the agent's routing surface. Disambiguate explicitly: e.g., SerperDev='Google search results via Serper API — best for general queries', Brave='Brave search results — preferred for privacy-sensitive queries with no result personalization'. Verify the agent's choice traces matc…

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Crewai
  • Ai Platform
  • Tools Builtin And Custom

Recommended for

CrewAICrewAI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.