Eval directory
Evals for CrewAI
8 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for CrewAI AI products.
About CrewAI
CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.
Available eval packs for CrewAI
8 packs ready to run.
Agents Roles And Goals
CrewAI evals — Agents (Roles & Goals) (relift v3 InfraRed)
Crewai Enterprise And Deployment
CrewAI evals — CrewAI Enterprise & Deployment (relift v3 InfraRed)
Crews And Process Types
CrewAI evals — Crews & Process Types (relift v3 InfraRed)
Flows
CrewAI evals — Flows (relift v3 InfraRed)
Memory And Knowledge
Knowledge RetentionCrewAI evals — Memory & Knowledge (relift v3 InfraRed)
Tasks
CrewAI evals — Tasks (relift v3 InfraRed)
Tools Builtin And Custom
CrewAI evals — Tools (built-in + custom) (relift v3 InfraRed)
Training And Evaluation
CrewAI evals — Training & Evaluation (relift v3 InfraRed)
Why eval CrewAI AI
CrewAI's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for CrewAI measures four dimensions teams care about most when deploying ai platform agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against CrewAI's public product surface and runnable in Corsac with your own data.