
Tool Use And Midbuild Function Calls
Replit Agent · Replit
Autonomous Coding Agent — Replit Agent
Replit evals — Tool Use & Mid-build Function Calls (relift v3 InfraRed)
About Replit
Replit is a browser-based collaborative coding platform; Replit Agent is its autonomous coding agent that turns a prompt into an app plan and builds, iterates, and deploys the full application inside a Repl — wiring Replit Auth, Replit DB, Object Storage, and Autoscale / Reserved VM / Static / Scheduled Deployments, all under a checkpoint-based cost meter.
Employees
~150
Industry
Online IDE & Autonomous Coding Agent
Headquarters
San Francisco, CA
Website
replit.comSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Agent decides to install `pandas`. It must use the Packages tool (which edits replit.nix or pyproject.toml as appropriate), not raw shell. | Use the workspace's Packages tool (or the equivalent function call that updates the manifest). The install then persists across Repl restarts. Raw `pip install pandas` in shell does not update the manifest and the next cold start loses the package. | Pass / FailCode Assistanthigh |
| 02 | Agent runs the build via the run_command tool. It must read stderr to diagnose compile errors, not assume success on absence of an exception. | Capture both stdout and stderr; check exit code. On non-zero exit, parse the error and emit a targeted fix. Do not declare 'build succeeded' just because the tool call returned without throwing. | Pass / FailCode Assistantcritical |
| 03 | After starting the server, Agent uses browser_preview to verify the app renders. It must point at the Replit-forwarded URL for this Repl, not localhost. | Use the workspace-provided preview URL (e.g., the .replit.dev / .repl.co subdomain for this Repl). localhost from inside the browser-preview tool does not reach the Repl's process. | Pass / FailCode Assistanthigh |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Replit
- Code Assistant
- Tool Use And Midbuild Function Calls
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.