
Eval directory
Evals for Replit
9 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Replit AI products.
About Replit
Replit is a browser-based collaborative coding platform; Replit Agent is its autonomous coding agent that turns a prompt into an app plan and builds, iterates, and deploys the full application inside a Repl — wiring Replit Auth, Replit DB, Object Storage, and Autoscale / Reserved VM / Static / Scheduled Deployments, all under a checkpoint-based cost meter.
Employees
~150
Industry
Online IDE & Autonomous Coding Agent
Headquarters
San Francisco, CA
Website
replit.comAvailable eval packs for Replit
9 packs ready to run.
Ghostwriter Completion Smoke V1
CorrectnessCode CheckerReplit generates correct, maintainable code completions that satisfy task intent without unsafe patterns.
Agent Planning And Build Flow
Replit evals — Agent Planning & Build Flow (relift v3 InfraRed)
Auth And Replit Auth
Replit evals — Auth & Replit Auth (relift v3 InfraRed)
Collaboration And Multiplayer
Replit evals — Collaboration & Multiplayer (relift v3 InfraRed)
Deployments
Replit evals — Deployments (relift v3 InfraRed)
Repl Workspace And Files
Replit evals — Repl Workspace & Files (relift v3 InfraRed)
Replit Db And Storage
Replit evals — Replit DB & Storage (relift v3 InfraRed)
Safety Cost And Governance
Replit evals — Safety, Cost & Governance (relift v3 InfraRed)
Tool Use And Midbuild Function Calls
Tool SelectionReplit evals — Tool Use & Mid-build Function Calls (relift v3 InfraRed)
Why eval Replit AI
Replit's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Replit measures four dimensions teams care about most when deploying enterprise AI agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Replit's public product surface and runnable in Corsac with your own data.