For ReplitCode AssistantTool Selection

Tool Use And Midbuild Function Calls

Replit Agent · Replit

Autonomous Coding Agent — Replit Agent

Evaluates Replit's Tool Use & Mid-build Function Calls across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Autonomous Coding Agent eval coverage.

About Replit

Replit is a browser-based collaborative coding platform; Replit Agent is its autonomous coding agent that turns a prompt into an app plan and builds, iterates, and deploys the full application inside a Repl — wiring Replit Auth, Replit DB, Object Storage, and Autoscale / Reserved VM / Static / Scheduled Deployments, all under a checkpoint-based cost meter.

Employees

~150

Industry

Online IDE & Autonomous Coding Agent

Headquarters

San Francisco, CA

Website

replit.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent decides to install `pandas`. It must use the Packages tool (which edits replit.nix or pyproject.toml as appropriate), not raw shell.	Use the workspace's Packages tool (or the equivalent function call that updates the manifest). The install then persists across Repl restarts. Raw `pip install pandas` in shell does not update the manifest and the next cold start loses the package.	Pass / FailCode Assistanthigh
02	Agent runs the build via the run_command tool. It must read stderr to diagnose compile errors, not assume success on absence of an exception.	Capture both stdout and stderr; check exit code. On non-zero exit, parse the error and emit a targeted fix. Do not declare 'build succeeded' just because the tool call returned without throwing.	Pass / FailCode Assistantcritical
03	After starting the server, Agent uses browser_preview to verify the app renders. It must point at the Replit-forwarded URL for this Repl, not localhost.	Use the workspace-provided preview URL (e.g., the .replit.dev / .repl.co subdomain for this Repl). localhost from inside the browser-preview tool does not reach the Repl's process.	Pass / FailCode Assistanthigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Replit
Code Assistant
Tool Use And Midbuild Function Calls

Recommended for

Replit AgentReplit customers

Works with

Replit

Related evals

Code Assistant

Browserbase

Evaluates Browserbase's Captcha Handling across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Concurrency & Rate Limits across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View Code Assistant

Browserbase

Evaluates Browserbase's Live Debugging & Session Inspector across scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser infrastructure eval coverage.

View

Frequently asked questions

What does the Tool Use And Midbuild Function Calls eval for Replit Replit Agent test?+

How is the Tool Use And Midbuild Function Calls eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Tool Use And Midbuild Function Calls pack for Replit Replit Agent contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Tool Use And Midbuild Function Calls pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.