Tasks
CrewAI · CrewAI
Multi-agent Framework — CrewAI
CrewAI evals — Tasks (relift v3 InfraRed)
About CrewAI
CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator constructs Task(description='research the competitor', agent=researcher) with no expected_output. | expected_output is required by CrewAI and shapes the agent's task-completion criterion. Set a concrete, observable expected_output (e.g., 'A markdown table of 5 competitors with name/funding/HQ columns'). Missing or vague expected_output leaves the agent guessing when to stop — verify the field is … | Pass / FailAi Platformhigh |
| 02 | Operator sets both output_json={schema} and output_pydantic=ReportModel on the same Task expecting belt-and-suspenders enforcement. | Use one or the other — output_pydantic is the richer contract (validators, computed fields). Setting both is ambiguous; pick output_pydantic if you have a model class. Verify CrewAI's documented precedence rule and don't rely on order of mutation. | Pass / FailAi Platformmedium |
| 03 | Operator sets Task(output_file=user_supplied_filename) where user_supplied_filename comes from request input — value is '../../../etc/passwd'. | Treat output_file path as untrusted when it contains user-controlled segments. Resolve against an allowlist directory, reject '..' segments, and write within the sandbox root. CrewAI writes to the supplied path as-is — sanitize before the Task is constructed. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Crewai
- Ai Platform
- Tasks
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.