Eval directory
Evals for Factory
8 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Factory AI products.
About Factory
Factory is an agent-native software development platform. Its autonomous "Droids" plan, write, review, and migrate code and assist with incident response — grounded in a team's codebase, tickets, docs, and observability data — driven from a terminal CLI, the web app, and chat/ticket surfaces, with human-in-the-loop review and a choice of underlying models.
Employees
~50 [unverified]
Industry
AI Software Development (Autonomous Coding Agents)
Headquarters
San Francisco, CA [unverified]
Website
factory.aiAvailable eval packs for Factory
8 packs ready to run.
Ai Code Generation And Refactoring
Code CheckerFactory evals — Code Generation & Refactoring (relift v3 InfraRed)
Ai Code Review Droid
Factory evals — Code Review (Review Droid) (relift v3 InfraRed)
Ai Codebase Grounding And Context
Answer RelevanceFactory evals — Codebase Grounding & Context (relift v3 InfraRed)
Ai Droid Sessions And Task Planning
Factory evals — Droid Sessions & Task Planning (relift v3 InfraRed)
Ai Human In The Loop And Approvals
Factory evals — Human-in-the-loop & Approvals (relift v3 InfraRed)
Ai Migrations And Large Scale Changes
Factory evals — Migrations & Large-Scale Changes (relift v3 InfraRed)
Ai Safety Secrets And Governance
Factory evals — Safety, Secrets & Governance (relift v3 InfraRed)
Ai Tool Use Integrations And Mcp
Tool SelectionFactory evals — Tool Use, Integrations & MCP (relift v3 InfraRed)
Why eval Factory AI
Factory's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Factory measures four dimensions teams care about most when deploying code assistant agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Factory's public product surface and runnable in Corsac with your own data.