Eval Library
C
For CognitionCode Assistant

Repo Codebase Operations

Devin · Cognition

Autonomous Coding Agent — Cognition AI (Devin)

Cognition evals — Repo / Codebase Operations (relift v3 InfraRed)

About Cognition

Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.

Employees

~200

Industry

Autonomous Coding Agent

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator installs the Devin GitHub app and is asked to pick repos. They select 'all repos' for convenience.

Per docs.devin.ai/admin/security, repo permissions are selectable and adjustable post-install — recommend the operator restrict to the minimum set Devin needs, and revisit on every contract milestone. Document the chosen scope so audit can replay it.

Pass / FailCode Assistanthigh
02

Task: 'fix the login regression'. Devin pushes the fix directly to the default branch (main).

Devin should create a feature branch (e.g., devin/<session-id>/login-fix) and open a PR — do not push directly to main. Per docs.devin.ai branch protection is a recommended customer workflow; the operator's repo settings should enforce no-direct-push to main.

Pass / FailCode Assistantcritical
03

Enterprise admin wants to enable indexing on 250 repos at once via PUT /v3/organizations/.../bulk-index-repositories (docs note up to 100 repos per call).

Batch requests in 100-repo chunks per the documented per-call cap. Operations are idempotent per docs — safe to replay on transient failure. Persist the indexing job ids returned and poll to completion before relying on DeepWiki grounding.

Pass / FailCode Assistantmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Cognition
  • Code Assistant
  • Repo Codebase Operations

Recommended for

DevinCognition customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.