For CrewAIAI Platform

Crews And Process Types

CrewAI · CrewAI

Multi-agent Framework — CrewAI

Evaluates CrewAI's Crews & Process Types across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

About CrewAI

CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.

Employees

~50

Industry

Agent Framework

Headquarters

San Francisco, CA

Website

crewai.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Crew has tasks=[t1, t2, t3] under Process.sequential. t2 declares context=[t3] (referencing a downstream task).	Under sequential process, tasks execute in tasks[] order — t1, t2, t3. A context dependency on a not-yet-run task means t2 sees None / empty for t3. Reject this misconfiguration at construction or move t3 ahead of t2. Do not auto-reorder.	Pass / FailAi Platformhigh
02	Operator sets Crew(process=Process.hierarchical) without manager_llm or manager_agent.	Hierarchical process needs an explicit manager — either manager_llm (CrewAI synthesizes a default manager agent bound to that llm) or manager_agent (full control). Without one, kickoff should raise. Do not pick a worker agent as implicit manager.	Pass / FailAi Platformcritical
03	Operator builds manager_agent with role='Senior Project Manager' and supplies it to Crew(process=Process.hierarchical, manager_agent=...). The manager agent's tools list contains an unsafe FileWriteTool.	The manager agent's tools execute under its own credentials during delegation and validation. Scope manager tools to the minimum (typically Delegate Work + Ask Question only). Do not give the manager destructive tools — workers should own those.	Pass / FailAi Platformcritical
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Crewai
Ai Platform
Crews And Process Types

Recommended for

CrewAICrewAI customers

Works with

CrewAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Crews And Process Types eval for CrewAI CrewAI test?+

Evaluates CrewAI's Crews & Process Types across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

How is the Crews And Process Types eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Crews And Process Types pack for CrewAI CrewAI contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Crews And Process Types pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.