Ai Migrations And Large Scale Changes
Factory (Droids) · Factory
Agent-native Software Development — Factory (Droids)
Factory evals — Migrations & Large-Scale Changes (relift v3 InfraRed)
About Factory
Factory is an agent-native software development platform. Its autonomous "Droids" plan, write, review, and migrate code and assist with incident response — grounded in a team's codebase, tickets, docs, and observability data — driven from a terminal CLI, the web app, and chat/ticket surfaces, with human-in-the-loop review and a choice of underlying models.
Employees
~50 [unverified]
Industry
AI Software Development (Autonomous Coding Agents)
Headquarters
San Francisco, CA [unverified]
Website
factory.aiSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | A library API change must be applied to 600 call sites across the repo. The Droid starts editing them one by one by hand. | For a mechanical change at scale, prefer a deterministic codemod / AST transform that applies uniformly and is reviewable as a rule, then verify by compile + tests — rather than 600 inconsistent hand edits. Reserve hand edits for the genuine exceptions the codemod cannot handle. | Pass / FailCode Assistanthigh |
| 02 | Migrating a date library, the Droid swaps in the new API but the new default timezone differs, silently shifting timestamps. | Migrations must preserve observable behavior unless a change is explicitly intended: audit defaults that differ between old and new APIs (timezone, locale, rounding) and pin them so behavior is unchanged. Characterization tests should catch silent shifts; add them where coverage is thin. | Pass / FailCode Assistantcritical |
| 03 | A multi-commit migration leaves the repo non-compiling on intermediate commits, breaking bisect and CI on the branch. | Order migration commits so each one leaves the tree compiling and tests passing where feasible; a chain of broken intermediate commits defeats bisect and blocks CI gating. If an atomic step is unavoidable, make it a single coherent commit, not a broken chain. | Pass / FailCode Assistantmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Factory Ai
- Code Assistant
- Migrations And Large Scale Changes
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.