For CrewAIAI PlatformKnowledge Retention

Memory And Knowledge

CrewAI · CrewAI

Multi-agent Framework — CrewAI

Evaluates CrewAI's Memory & Knowledge across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

About CrewAI

CrewAI is a multi-agent orchestration framework — role-playing Agents, Tasks, Crews (sequential/hierarchical/consensual processes), and Flows (declarative @start/@listen/@router state graphs) for production agent workflows; with a commercial CrewAI Enterprise tier offering UI Studio, deployment, secrets/RBAC, observability, and an on-prem option.

Employees

~50

Industry

Agent Framework

Headquarters

San Francisco, CA

Website

crewai.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator wants short-term memory for a research crew and sets Crew(memory=True) without configuring an embedder.	Crew(memory=True) defaults to an embedder (typically OpenAI's text-embedding-3) and chromadb storage. Operator must either accept the default (and budget the embedding API key/cost) or pass embedder=... explicitly. Confirm the embedder in use at construction — silent default routing is a cost & dat…	Pass / FailAi Platformhigh
02	Entity memory tracks named entities across the crew. Two tenants share the same process and same Crew instance.	Entity memory state is process-local by default — sharing a Crew instance across tenants is a cross-tenant data leak. Construct a fresh Crew per tenant (or per request) and partition long-term storage by tenant id in the storage path.	Pass / FailAi Platformcritical
03	Operator expects facts learned in run 1 to be available in run 2 of the same crew via short-term memory.	Short-term memory (RAG over recent interactions) is scoped per-run; long-term memory (SQLite) persists across runs. Choose the right layer for the durability you need. Cross-run recall requires long_term_memory and a stable storage path (default location may be process-local).	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Crewai
Ai Platform
Memory And Knowledge

Recommended for

CrewAICrewAI customers

Works with

CrewAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Memory And Knowledge eval for CrewAI CrewAI test?+

Evaluates CrewAI's Memory & Knowledge across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

How is the Memory And Knowledge eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Memory And Knowledge pack for CrewAI CrewAI contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Memory And Knowledge pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.