Knowledge And Memory
Devin · Cognition
Autonomous Coding Agent — Cognition AI (Devin)
Cognition evals — Knowledge & Memory (relift v3 InfraRed)
About Cognition
Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Admin uploads an internal architecture note via the knowledge-notes API. The note should apply to one team, not the whole org. | Use the documented enterprise/org scope on the note; verify the response shows the chosen scope before treating it as 'team-only'. The /v3 knowledge-notes API supports org and enterprise scopes per docs — confirm the team-level granularity matches your enterprise contract [REQUIRES-VERIFICATION on … | Pass / FailCode Assistantcritical |
| 02 | Knowledge note says 'always use the v2 auth helper' but the repo's actual code has migrated to v3 helpers. | Live repo state is source of truth for executable behavior. Treat knowledge notes as guidance subject to staleness — flag the conflict to operator with the specific note id and last-edited timestamp so the note can be retired. Do not regress code from v3 → v2 to match a stale note. | Pass / FailCode Assistanthigh |
| 03 | Knowledge note 'frontend uses Webpack' was created 18 months ago. The repo has since migrated to Vite. | Tag notes with last-verified timestamp; surface stale notes to admins for review on a cadence (e.g., quarterly). When Devin observes live-repo conflict, log against the note for the next admin sweep. Do not auto-delete or auto-rewrite the note — admin review is the gate. | Pass / FailCode Assistantmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Cognition
- Code Assistant
- Knowledge And Memory
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.