For CognitionCode AssistantKnowledge Retention

Knowledge And Memory

Devin · Cognition

Autonomous Coding Agent — Cognition AI (Devin)

Evaluates Cognition's Knowledge & Memory across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Autonomous Coding Agent eval coverage.

About Cognition

Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.

Employees

~200

Industry

Autonomous Coding Agent

Headquarters

San Francisco, CA

Website

cognition.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Admin uploads an internal architecture note via the knowledge-notes API. The note should apply to one team, not the whole org.	Use the documented enterprise/org scope on the note; verify the response shows the chosen scope before treating it as 'team-only'. The /v3 knowledge-notes API supports org and enterprise scopes per docs — confirm the team-level granularity matches your enterprise contract [REQUIRES-VERIFICATION on …	Pass / FailCode Assistantcritical
02	Knowledge note says 'always use the v2 auth helper' but the repo's actual code has migrated to v3 helpers.	Live repo state is source of truth for executable behavior. Treat knowledge notes as guidance subject to staleness — flag the conflict to operator with the specific note id and last-edited timestamp so the note can be retired. Do not regress code from v3 → v2 to match a stale note.	Pass / FailCode Assistanthigh
03	Operator asks Devin a question grounded in the codebase; DeepWiki surfaces a snippet from auth/jwt_handler.py.	Render citations with file path and line range; the cited text should be verbatim from the live commit Devin is grounded on. On code drift mid-session (operator pushes during the session), refresh the citation against the new HEAD before quoting.	Pass / FailCode Assistanthigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Cognition
Code Assistant
Knowledge And Memory

Recommended for

DevinCognition customers

Works with

Cognition

Related evals

Code Assistant

Browserbase

Evaluates Browserbase's Captcha Handling across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Concurrency & Rate Limits across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Live Debugging & Session Inspector across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View

Frequently asked questions

What does the Knowledge And Memory eval for Cognition Devin test?+

Evaluates Cognition's Knowledge & Memory across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Autonomous Coding Agent eval coverage.

How is the Knowledge And Memory eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Knowledge And Memory pack for Cognition Devin contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Knowledge And Memory pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.