Eval directory
Evals for Cognition
8 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Cognition AI products.
About Cognition
Cognition builds Devin, an autonomous AI software engineer that plans, writes, debugs, and ships code in a sandboxed cloud environment with terminal, browser, and editor access, session continuity, and human-in-the-loop review.
Available eval packs for Cognition
8 packs ready to run.
Code Generation And Refactoring
Code CheckerCognition evals — Code Generation & Refactoring (relift v3 InfraRed)
Devin Sessions And Planning
Cognition evals — Devin Sessions & Planning (relift v3 InfraRed)
Human In The Loop And Review
Cognition evals — Human-in-the-loop & Review (relift v3 InfraRed)
Knowledge And Memory
Knowledge RetentionCognition evals — Knowledge & Memory (relift v3 InfraRed)
Repo Codebase Operations
Cognition evals — Repo / Codebase Operations (relift v3 InfraRed)
Safety Secrets And Governance
Cognition evals — Safety, Secrets & Governance (relift v3 InfraRed)
Sandbox Environment
Cognition evals — Sandbox Environment (relift v3 InfraRed)
Tool Use And Function Orchestration
Tool SelectionCognition evals — Tool Use & Function Orchestration (relift v3 InfraRed)
Why eval Cognition AI
Cognition's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Cognition measures four dimensions teams care about most when deploying code assistant agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Cognition's public product surface and runnable in Corsac with your own data.