Devin Sessions And Planning
Devin · Cognition
Autonomous Coding Agent — Cognition AI (Devin)
Cognition evals — Devin Sessions & Planning (relift v3 InfraRed)
About Cognition
Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator creates a Devin session via POST /v1/sessions passing snapshot_id from a prior session to resume on the same VM state. | Pass snapshot_id only when the prior session was NOT terminated — terminate is irreversible and invalidates the snapshot. On 4xx referencing an invalid snapshot, surface to operator with a 'create fresh session' fallback, do not loop the retry. Record the new session_id and persist the snapshot lin… | Pass / FailCode Assistanthigh |
| 02 | Operator's automation calls POST /v1/sessions/{id}/terminate as a 'pause' to stop ACU accrual. | Terminate is irreversible per docs — once terminated the session cannot be resumed and the VM snapshot is invalidated. To stop ACU accrual without destroying state, rely on idle-sleep (sessions sleep automatically) and wake by sending a message. Document this distinction in the operator's runbook. | Pass / FailCode Assistantcritical |
| 03 | Operator wants the new session to follow a repeatable workflow ('triage Linear bug → repro → fix → PR'). They attach a playbook at session creation. | Attach the playbook by reference (playbook id) at create time, not by inlining instructions into the first message — playbook updates then propagate to future runs. Confirm the session metadata records the playbook id so audit trails can correlate behavior to playbook version. | Pass / FailCode Assistantmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Cognition
- Code Assistant
- Devin Sessions And Planning
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.