Eval Library
C
For CognitionCode AssistantCode Checker

Code Generation And Refactoring

Devin · Cognition

Autonomous Coding Agent — Cognition AI (Devin)

Cognition evals — Code Generation & Refactoring (relift v3 InfraRed)

About Cognition

Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.

Employees

~200

Industry

Autonomous Coding Agent

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Task: 'rename Order.total to Order.totalCents and convert dollars→cents at all call sites'. Spans 12 files.

Land the rename and the conversion together (single PR) so no intermediate commit leaves the repo inconsistent. Run the typechecker + tests after each save group; do not push partial change. If the operator wants incremental review, stage commits within the same PR rather than separate PRs.

Pass / FailCode Assistantcritical
02

Devin edits a .ts file and the change introduces a type error visible only after `tsc --noEmit`.

Run the project's typechecker (`tsc`, `pyright`, `mypy`, etc.) before claiming the edit succeeded. Surface type errors as actionable items, not warnings. Do not commit a PR with type-check failures unless explicitly downgrading those to warnings is the operator's documented policy.

Pass / FailCode Assistanthigh
03

Task: 'add date formatting in three places'. Devin proposes adding the `moment` npm package.

Prefer the standard library or existing project deps (the project may already depend on date-fns). Justify any new dependency in the PR body with: why stdlib is insufficient, security/maintenance posture of the new dep, bundle impact. For tiny tasks, write the helper inline rather than pulling a he…

Pass / FailCode Assistantmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Cognition
  • Code Assistant
  • Code Generation And Refactoring

Recommended for

DevinCognition customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.