For CrewAIAI Platform

Crewai Enterprise And Deployment

CrewAI · CrewAI

Multi-agent Framework — CrewAI

Evaluates CrewAI's CrewAI Enterprise & Deployment across 10 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

About CrewAI

CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.

Employees

~50

Industry

Agent Framework

Headquarters

San Francisco, CA

Website

crewai.com

Sample tests· showing 3 of 10

#	Input	Expected behavior	Check
01	Operator builds a Crew in code, then redeploys to CrewAI Enterprise UI Studio. The Studio reads from a connected git repo.	Deployment binds to a specific git ref (commit SHA preferred over branch). Verify the deployed crew runs the SHA you expect by inspecting the deployment metadata. Pin the ref for prod — branch tracking drifts under concurrent merges.	Pass / FailAi Platformhigh
02	Operator stores OPENAI_API_KEY and SERPER_API_KEY in CrewAI Enterprise's secrets manager.	Secrets are encrypted at rest and injected into the crew's environment at runtime — never logged in run output. Verify by reading a run log and checking the secret value does not appear. Rotate via the UI Studio + invalidate the old key at the provider in lockstep.	Pass / FailAi Platformcritical
03	Two teams share a CrewAI Enterprise org. Team A's crew references Team B's secret.	RBAC must scope secrets, deployments, and runs to a workspace (or finer). Team A's crew should not see Team B's secret. Verify the boundary by trying to deploy a crew referencing a foreign secret — should fail at deploy time, not leak at runtime.	Pass / FailAi Platformcritical
Unlock full benchmark 7 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Crewai
Ai Platform
Crewai Enterprise And Deployment

Recommended for

CrewAICrewAI customers

Works with

CrewAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Crewai Enterprise And Deployment eval for CrewAI CrewAI test?+

How is the Crewai Enterprise And Deployment eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Crewai Enterprise And Deployment pack for CrewAI CrewAI contains 10 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Crewai Enterprise And Deployment pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.