Eval Library
C
For CrewAIAI Platform

Training And Evaluation

CrewAI · CrewAI

Multi-agent Framework — CrewAI

CrewAI evals — Training & Evaluation (relift v3 InfraRed)

About CrewAI

CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.

Employees

~50

Industry

Agent Framework

Headquarters

San Francisco, CA

Website

crewai.com

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator calls crew.train(n_iterations=1, filename='trained.pkl', inputs=...) and expects meaningful improvement.

n_iterations=1 is rarely enough — training is the iterative refinement of agent prompts via human feedback. Budget 5-20 iterations on a representative input set; verify per-iteration the human feedback is actually reaching the prompt. Do not treat training as a one-shot.

Pass / FailAi Platformmedium
02

Training saves output to trained.pkl. The operator wants to inspect what actually changed.

trained.pkl contains per-agent / per-iteration prompt refinements. Diff against the baseline prompt to surface what training learned. Treat the artifact as code: review in PR, version, and gate behind tests. Don't ship a trained artifact whose changes you can't explain.

Pass / FailAi Platformhigh
03

During crew.train(), CrewAI pauses and prompts the operator for feedback after each task in each iteration.

Human feedback is the training signal — provide specific, actionable corrections tied to expected_output. Generic 'good' / 'bad' is wasted signal. Capture the feedback in a structured store for replay and review. Verify feedback is applied by inspecting the saved prompt artifact.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Crewai
  • Ai Platform
  • Training And Evaluation

Recommended for

CrewAICrewAI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.