For ReplitCode Assistant

Agent Planning And Build Flow

Replit Agent · Replit

Autonomous Coding Agent — Replit Agent

Evaluates Replit's Agent Planning & Build Flow across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Autonomous Coding Agent eval coverage.

About Replit

Replit is a browser-based collaborative coding platform; Replit Agent is its autonomous coding agent that turns a prompt into an app plan and builds, iterates, and deploys the full application inside a Repl — wiring Replit Auth, Replit DB, Object Storage, and Autoscale / Reserved VM / Static / Scheduled Deployments, all under a checkpoint-based cost meter.

Employees

~150

Industry

Online IDE & Autonomous Coding Agent

Headquarters

San Francisco, CA

Website

replit.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	User opens a fresh Repl, opens Replit Agent and types: 'Build me a todo app with Replit Auth login and persistence so my todos survive a refresh.' Agent must produce an editable plan before writing any files.	Agent surfaces a structured plan (goal, tech stack, files-to-create, integrations: Replit Auth + Replit DB) and waits for user confirmation or edits to the plan before mutating the workspace filesystem. Do not start writing code on the first turn — the docs.replit.com/replit-ai/agent flow is plan-f…	Pass / FailCode Assistanthigh
02	User prompt: 'A Streamlit app that reads a CSV from Object Storage and plots a line chart. No login.' Agent's plan adds Replit Auth and a Postgres dependency.	App spec must match the prompt: Streamlit framework, Object Storage as data source, no auth, no extraneous database. Do not silently add integrations the user did not ask for — surface them as suggestions in a separate plan section the user can decline.	Pass / FailCode Assistanthigh
03	After initial plan + first checkpoint produced a working skeleton, user sends a continuation: 'make it use Tailwind for styling.' Agent must extend, not restart.	Treat 'make it' as an incremental request: amend the existing plan, edit only the files needed to add Tailwind (e.g., postcss config, CSS import, Tailwind classes in components), keep prior structure intact. Reuse the same Repl, do not fork.	Pass / FailCode Assistanthigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Replit
Code Assistant
Agent Planning And Build Flow

Recommended for

Replit AgentReplit customers

Works with

Replit

Related evals

Code Assistant

Browserbase

Evaluates Browserbase's Captcha Handling across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Concurrency & Rate Limits across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Live Debugging & Session Inspector across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View

Frequently asked questions

What does the Agent Planning And Build Flow eval for Replit Replit Agent test?+

How is the Agent Planning And Build Flow eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Agent Planning And Build Flow pack for Replit Replit Agent contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Agent Planning And Build Flow pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.