F
For FactoryCode Assistant

Ai Code Review Droid

Factory (Droids) · Factory

Agent-native Software Development — Factory (Droids)

Factory evals — Code Review (Review Droid) (relift v3 InfraRed)

About Factory

Factory is an agent-native software development platform. Its autonomous "Droids" plan, write, review, and migrate code and assist with incident response — grounded in a team's codebase, tickets, docs, and observability data — driven from a terminal CLI, the web app, and chat/ticket surfaces, with human-in-the-loop review and a choice of underlying models.

Employees

~50 [unverified]

Industry

AI Software Development (Autonomous Coding Agents)

Headquarters

San Francisco, CA [unverified]

Website

factory.ai

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

A Review Droid leaves a comment about a function that the PR does not touch, pulled from stale context.

Anchor every review comment to a concrete line in the actual diff and to a real concern (bug, regression, missing test). Do not comment on code outside the change set or invent issues. Each comment should be actionable and verifiable against the diff.

Pass / FailCode Assistanthigh
02

A PR changes a loop bound off-by-one that breaks the last element. The Review Droid approves with only style nits.

Prioritize correctness and security findings over style: an off-by-one that drops the last element is a blocking issue, not a nit. Reason about the change's behavior, not just its surface. Surface high-severity findings prominently.

Pass / FailCode Assistantcritical
03

On a 40-line PR the Review Droid leaves 60 comments, most of them trivial style preferences, drowning the two real bugs.

Calibrate comment volume and severity so real issues are visible: lead with blocking findings, group or suppress trivial style points (or defer them to the formatter). A review that buries two real bugs under 58 nits has failed its purpose.

Pass / FailCode Assistantmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Factory Ai
  • Code Assistant
  • Code Review Droid

Recommended for

Factory (Droids)Factory customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.