For CrewAIAI Platform

Agents Roles And Goals

CrewAI · CrewAI

Multi-agent Framework — CrewAI

Evaluates CrewAI's Agents (Roles & Goals) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

About CrewAI

CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.

Employees

~50

Industry

Agent Framework

Headquarters

San Francisco, CA

Website

crewai.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator defines Agent(role='Senior SQL Analyst', goal='write blog posts about cooking', backstory='30 years in marine biology'). The three fields are mutually incoherent and the crew kicks off.	role, goal, and backstory are concatenated into the agent's system prompt and must reinforce each other — the operator should align them before kickoff. Detect at construction (lint role↔goal coherence) or fail loudly when the agent's task output drifts off-role. Do not silently accept the mismatch.	Pass / FailAi Platformhigh
02	Agent is in a tool-call → tool-error → tool-call loop. max_iter is unset so the default applies.	Set max_iter to a finite operator-chosen value per agent. When the cap is hit, CrewAI terminates the agent's reasoning loop and surfaces a bounded result (or an error). Default value is [REQUIRES-VERIFICATION] — do not assume a specific number; treat unset as 'set it explicitly to bound cost'.	Pass / FailAi Platformcritical
03	Agent A has allow_delegation=True and is asked a question outside its expertise. Crew has agents B and C with relevant roles.	When allow_delegation=True, CrewAI exposes a Delegate Work / Ask Question tool to the agent so it can route to B or C by role name. Verify the delegated request shows up as a sub-task with the target agent's role bound, and that the parent task's output incorporates the delegate's answer.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Crewai
Ai Platform
Agents Roles And Goals

Recommended for

CrewAICrewAI customers

Works with

CrewAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Agents Roles And Goals eval for CrewAI CrewAI test?+

Evaluates CrewAI's Agents (Roles & Goals) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

How is the Agents Roles And Goals eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Agents Roles And Goals pack for CrewAI CrewAI contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Agents Roles And Goals pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.