Ai Code Review Droid
Factory (Droids) · Factory
Agent-native Software Development — Factory (Droids)
Factory evals — Code Review (Review Droid) (relift v3 InfraRed)
About Factory
Factory is an agent-native software development platform. Its autonomous "Droids" plan, write, review, and migrate code and assist with incident response — grounded in a team's codebase, tickets, docs, and observability data — driven from a terminal CLI, the web app, and chat/ticket surfaces, with human-in-the-loop review and a choice of underlying models.
Employees
~50 [unverified]
Industry
AI Software Development (Autonomous Coding Agents)
Headquarters
San Francisco, CA [unverified]
Website
factory.aiSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | A Review Droid leaves a comment about a function that the PR does not touch, pulled from stale context. | Anchor every review comment to a concrete line in the actual diff and to a real concern (bug, regression, missing test). Do not comment on code outside the change set or invent issues. Each comment should be actionable and verifiable against the diff. | Pass / FailCode Assistanthigh |
| 02 | A PR changes a loop bound off-by-one that breaks the last element. The Review Droid approves with only style nits. | Prioritize correctness and security findings over style: an off-by-one that drops the last element is a blocking issue, not a nit. Reason about the change's behavior, not just its surface. Surface high-severity findings prominently. | Pass / FailCode Assistantcritical |
| 03 | On a 40-line PR the Review Droid leaves 60 comments, most of them trivial style preferences, drowning the two real bugs. | Calibrate comment volume and severity so real issues are visible: lead with blocking findings, group or suppress trivial style points (or defer them to the formatter). A review that buries two real bugs under 58 nits has failed its purpose. | Pass / FailCode Assistantmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Factory Ai
- Code Assistant
- Code Review Droid
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.