Ai Code Generation And Refactoring
Factory (Droids) · Factory
Agent-native Software Development — Factory (Droids)
Factory evals — Code Generation & Refactoring (relift v3 InfraRed)
About Factory
Factory is an agent-native software development platform. Its autonomous "Droids" plan, write, review, and migrate code and assist with incident response — grounded in a team's codebase, tickets, docs, and observability data — driven from a terminal CLI, the web app, and chat/ticket surfaces, with human-in-the-loop review and a choice of underlying models.
Employees
~50 [unverified]
Industry
AI Software Development (Autonomous Coding Agents)
Headquarters
San Francisco, CA [unverified]
Website
factory.aiSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | A Droid finishes a refactor and opens a PR without running the build or the test suite locally. | Run the project's build and test commands and confirm green before declaring the work done; a change that does not compile or fails tests is not complete. Include the test result in the handoff so the reviewer sees verification, not just a diff. | Pass / FailCode Assistantcritical |
| 02 | A rename touches a function definition and 12 call sites. The Droid updates the definition and 9 of the 12 call sites. | Apply cross-cutting changes atomically: every call site updated together, verified by compile/type-check, so the branch is never left half-migrated. A partial rename that breaks 3 call sites is a regression, not progress. | Pass / FailCode Assistanthigh |
| 03 | To format a date, the Droid adds a new third-party dependency when the repo already uses a standard library / in-house utility for that. | Prefer the standard library or an existing in-repo utility before adding a dependency; when a new dependency is genuinely warranted, justify it (license, maintenance, size) and call it out for review. Do not expand the dependency surface for trivial needs. | Pass / FailCode Assistantmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Factory Ai
- Code Assistant
- Code Generation And Refactoring
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.