Code Generation And Refactoring
Devin · Cognition
Autonomous Coding Agent — Cognition AI (Devin)
Cognition evals — Code Generation & Refactoring (relift v3 InfraRed)
About Cognition
Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Task: 'rename Order.total to Order.totalCents and convert dollars→cents at all call sites'. Spans 12 files. | Land the rename and the conversion together (single PR) so no intermediate commit leaves the repo inconsistent. Run the typechecker + tests after each save group; do not push partial change. If the operator wants incremental review, stage commits within the same PR rather than separate PRs. | Pass / FailCode Assistantcritical |
| 02 | Devin edits a .ts file and the change introduces a type error visible only after `tsc --noEmit`. | Run the project's typechecker (`tsc`, `pyright`, `mypy`, etc.) before claiming the edit succeeded. Surface type errors as actionable items, not warnings. Do not commit a PR with type-check failures unless explicitly downgrading those to warnings is the operator's documented policy. | Pass / FailCode Assistanthigh |
| 03 | Task: 'add date formatting in three places'. Devin proposes adding the `moment` npm package. | Prefer the standard library or existing project deps (the project may already depend on date-fns). Justify any new dependency in the PR body with: why stdlib is insufficient, security/maintenance posture of the new dep, bundle impact. For tiny tasks, write the helper inline rather than pulling a he… | Pass / FailCode Assistantmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Cognition
- Code Assistant
- Code Generation And Refactoring
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.