Eval Library
C
For CognitionCode AssistantTool Selection

Tool Use And Function Orchestration

Devin · Cognition

Autonomous Coding Agent — Cognition AI (Devin)

Cognition evals — Tool Use & Function Orchestration (relift v3 InfraRed)

About Cognition

Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.

Employees

~200

Industry

Autonomous Coding Agent

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Devin needs to run a Bash-specific command (`source ./.env`) but the sandbox's default shell is sh.

Explicitly invoke `bash -lc '...'` for Bash-specific syntax or normalize scripts to POSIX sh. Do not assume the default shell is Bash; capture and check the actual shell at session start. On `source: not found` errors, fall back to the explicit invocation rather than retrying.

Pass / FailCode Assistantmedium
02

Operator connects an MCP server (e.g., Linear MCP) to the session so Devin can fetch ticket context. The MCP server returns a malformed response.

Surface the MCP error to Devin's reasoning as a tool error with the upstream message verbatim. Do not swallow into a generic 'tool failed' string. Cap retries (e.g., 2) before falling back to asking the operator for the missing context. Log MCP errors with the server id for diagnosis.

Pass / FailCode Assistanthigh
03

Devin runs `git pull --rebase` which results in a conflict. Devin resolves the conflict and continues. Later, an earlier commit appears to be missing.

Use `git reflog` to recover the lost SHA before forcing any history change. Never rebase on a branch shared with humans without explicit operator approval. On rebase, capture pre-rebase HEAD as a backup ref. Surface lost-commit detection to operator immediately.

Pass / FailCode Assistantcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Cognition
  • Code Assistant
  • Tool Use And Function Orchestration

Recommended for

DevinCognition customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.