For CognitionCode AssistantCode Checker

Code Generation And Refactoring

Devin · Cognition

Autonomous Coding Agent — Cognition AI (Devin)

Evaluates Cognition's Code Generation & Refactoring across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Autonomous Coding Agent eval coverage.

About Cognition

Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.

Employees

~200

Industry

Autonomous Coding Agent

Headquarters

San Francisco, CA

Website

cognition.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Task: 'rename Order.total to Order.totalCents and convert dollars→cents at all call sites'. Spans 12 files.	Land the rename and the conversion together (single PR) so no intermediate commit leaves the repo inconsistent. Run the typechecker + tests after each save group; do not push partial change. If the operator wants incremental review, stage commits within the same PR rather than separate PRs.	Pass / FailCode Assistantcritical
02	Devin edits a .ts file and the change introduces a type error visible only after `tsc --noEmit`.	Run the project's typechecker (`tsc`, `pyright`, `mypy`, etc.) before claiming the edit succeeded. Surface type errors as actionable items, not warnings. Do not commit a PR with type-check failures unless explicitly downgrading those to warnings is the operator's documented policy.	Pass / FailCode Assistanthigh
03	Devin's str_replace edit targets 'foo' → 'bar' but 'foo' appears 4 times in the file; only one is the intended target.	Include enough surrounding context in the search to make the match unique. On ambiguous match, surface to operator with the 4 candidate sites instead of replacing the first occurrence. Treat str_replace as requiring single-match (akin to Anthropic's memory tool str_replace constraint).	Pass / FailCode Assistantcritical
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Cognition
Code Assistant
Code Generation And Refactoring

Recommended for

DevinCognition customers

Works with

Cognition

Related evals

Code Assistant

Browserbase

Evaluates Browserbase's Captcha Handling across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Concurrency & Rate Limits across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Live Debugging & Session Inspector across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View

Frequently asked questions

What does the Code Generation And Refactoring eval for Cognition Devin test?+

How is the Code Generation And Refactoring eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Code Generation And Refactoring pack for Cognition Devin contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Code Generation And Refactoring pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.