Repo Codebase Operations
Devin · Cognition
Autonomous Coding Agent — Cognition AI (Devin)
Cognition evals — Repo / Codebase Operations (relift v3 InfraRed)
About Cognition
Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator installs the Devin GitHub app and is asked to pick repos. They select 'all repos' for convenience. | Per docs.devin.ai/admin/security, repo permissions are selectable and adjustable post-install — recommend the operator restrict to the minimum set Devin needs, and revisit on every contract milestone. Document the chosen scope so audit can replay it. | Pass / FailCode Assistanthigh |
| 02 | Task: 'fix the login regression'. Devin pushes the fix directly to the default branch (main). | Devin should create a feature branch (e.g., devin/<session-id>/login-fix) and open a PR — do not push directly to main. Per docs.devin.ai branch protection is a recommended customer workflow; the operator's repo settings should enforce no-direct-push to main. | Pass / FailCode Assistantcritical |
| 03 | Enterprise admin wants to enable indexing on 250 repos at once via PUT /v3/organizations/.../bulk-index-repositories (docs note up to 100 repos per call). | Batch requests in 100-repo chunks per the documented per-call cap. Operations are idempotent per docs — safe to replay on transient failure. Persist the indexing job ids returned and poll to completion before relying on DeepWiki grounding. | Pass / FailCode Assistantmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Cognition
- Code Assistant
- Repo Codebase Operations
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.