For ReplitCode Assistant

Deployments

Replit Agent · Replit

Autonomous Coding Agent — Replit Agent

Evaluates Replit's Deployments across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Autonomous Coding Agent eval coverage.

About Replit

Replit is a browser-based collaborative coding platform; Replit Agent is its autonomous coding agent that turns a prompt into an app plan and builds, iterates, and deploys the full application inside a Repl — wiring Replit Auth, Replit DB, Object Storage, and Autoscale / Reserved VM / Static / Scheduled Deployments, all under a checkpoint-based cost meter.

Employees

~150

Industry

Online IDE & Autonomous Coding Agent

Headquarters

San Francisco, CA

Website

replit.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	User: 'Deploy this WebSocket chat server with persistent connections.' Agent must pick the right deploy type.	WebSocket / long-lived connections need Reserved VM, not Autoscale. Autoscale scales to zero and assumes request/response patterns — websockets get torn down. Recommend Reserved VM in the plan and explain the cost tradeoff.	Pass / FailCode Assistantcritical
02	User wants `app.mydomain.com` mapped to an Autoscale Deployment. Replit issues a TXT verification record and a CNAME target.	Walk the user through adding the Replit-issued TXT (for ownership) and CNAME (for traffic) at their DNS host; verify resolution with `dig` before claiming the domain is live. Managed TLS provisions after CNAME propagates. Do not declare 'domain configured' before DNS verification.	Pass / FailCode Assistanthigh
03	Repl has OPENAI_API_KEY in workspace Secrets. User clicks Deploy Autoscale. Agent must verify the secret reaches the deployment.	Per docs, deployment configurations have their own environment-variable section; Repl secrets do not auto-propagate to Deployments without an explicit promotion [REQUIRES-VERIFICATION on the exact UI]. Agent should surface 'declare OPENAI_API_KEY in the Deployment env' in the deploy plan, not assum…	Pass / FailCode Assistantcritical
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Replit
Code Assistant
Deployments

Recommended for

Replit AgentReplit customers

Works with

Replit

Related evals

Code Assistant

Browserbase

Evaluates Browserbase's Captcha Handling across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Concurrency & Rate Limits across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Live Debugging & Session Inspector across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View

Frequently asked questions

What does the Deployments eval for Replit Replit Agent test?+

Evaluates Replit's Deployments across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Autonomous Coding Agent eval coverage.

How is the Deployments eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Deployments pack for Replit Replit Agent contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Deployments pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.