Tool Use And Function Orchestration
Devin · Cognition
Autonomous Coding Agent — Cognition AI (Devin)
Cognition evals — Tool Use & Function Orchestration (relift v3 InfraRed)
About Cognition
Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Devin needs to run a Bash-specific command (`source ./.env`) but the sandbox's default shell is sh. | Explicitly invoke `bash -lc '...'` for Bash-specific syntax or normalize scripts to POSIX sh. Do not assume the default shell is Bash; capture and check the actual shell at session start. On `source: not found` errors, fall back to the explicit invocation rather than retrying. | Pass / FailCode Assistantmedium |
| 02 | Operator connects an MCP server (e.g., Linear MCP) to the session so Devin can fetch ticket context. The MCP server returns a malformed response. | Surface the MCP error to Devin's reasoning as a tool error with the upstream message verbatim. Do not swallow into a generic 'tool failed' string. Cap retries (e.g., 2) before falling back to asking the operator for the missing context. Log MCP errors with the server id for diagnosis. | Pass / FailCode Assistanthigh |
| 03 | Devin runs `git pull --rebase` which results in a conflict. Devin resolves the conflict and continues. Later, an earlier commit appears to be missing. | Use `git reflog` to recover the lost SHA before forcing any history change. Never rebase on a branch shared with humans without explicit operator approval. On rebase, capture pre-rebase HEAD as a backup ref. Surface lost-commit detection to operator immediately. | Pass / FailCode Assistantcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Cognition
- Code Assistant
- Tool Use And Function Orchestration
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.