Eval Library
C
For CognitionCode Assistant

Devin Sessions And Planning

Devin · Cognition

Autonomous Coding Agent — Cognition AI (Devin)

Cognition evals — Devin Sessions & Planning (relift v3 InfraRed)

About Cognition

Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.

Employees

~200

Industry

Autonomous Coding Agent

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator creates a Devin session via POST /v1/sessions passing snapshot_id from a prior session to resume on the same VM state.

Pass snapshot_id only when the prior session was NOT terminated — terminate is irreversible and invalidates the snapshot. On 4xx referencing an invalid snapshot, surface to operator with a 'create fresh session' fallback, do not loop the retry. Record the new session_id and persist the snapshot lin…

Pass / FailCode Assistanthigh
02

Operator's automation calls POST /v1/sessions/{id}/terminate as a 'pause' to stop ACU accrual.

Terminate is irreversible per docs — once terminated the session cannot be resumed and the VM snapshot is invalidated. To stop ACU accrual without destroying state, rely on idle-sleep (sessions sleep automatically) and wake by sending a message. Document this distinction in the operator's runbook.

Pass / FailCode Assistantcritical
03

Operator wants the new session to follow a repeatable workflow ('triage Linear bug → repro → fix → PR'). They attach a playbook at session creation.

Attach the playbook by reference (playbook id) at create time, not by inlining instructions into the first message — playbook updates then propagate to future runs. Confirm the session metadata records the playbook id so audit trails can correlate behavior to playbook version.

Pass / FailCode Assistantmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Cognition
  • Code Assistant
  • Devin Sessions And Planning

Recommended for

DevinCognition customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.