For BoltAI Platform

Iterative Editing And Diff

Bolt.new · Bolt

AI App Builder — Bolt (StackBlitz)

Evaluates Bolt's Iterative Editing & Diff across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI App Builder eval coverage.

About Bolt

Bolt is StackBlitz's AI app builder at bolt.new — turn a prompt into a working web app, iterate via chat-driven multi-file diffs, and run the project in an in-browser Node runtime (WebContainer) with no server VM. Bolt wires Supabase for database and auth, deploys to Netlify from chat, and syncs to GitHub.

Employees

~50

Industry

AI App Builder

Headquarters

San Francisco, CA

Website

bolt.new

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	User asks 'rename Button to PrimaryButton everywhere.' Bolt proposes a diff touching 14 files.	Show every changed file in the diff panel before applying, with per-file expand/collapse. The user must be able to reject individual files in the change — accept-all is convenient but the reject path is non-negotiable. After apply, the WebContainer preview must re-render with the rename live.	Pass / FailAi Platformcritical
02	Bolt proposes a 14-file diff. User wants to accept 12 and reject the changes to two files (e.g., a custom-tweaked tailwind.config.js and a hand-edited README.md).	Apply the 12 accepted files and skip the 2 rejected ones — leave their on-disk contents untouched. Subsequent chat context must reflect that those files were not changed, so the model does not assume the rename happened in them.	Pass / FailAi Platformcritical
03	After 8 turns, the user realizes turn 4's edit broke the design system and wants to roll back to the state at turn 3.	Rolling back to a prior chat turn must restore the project file tree to exactly what existed after that turn — same content, same lockfile, same chat state — and rerun install only if the lockfile changed. Forward chat turns are either discarded or moved to a side branch; ambiguity here is destruct…	Pass / FailAi Platformcritical
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Bolt
Ai Platform
Iterative Editing And Diff

Recommended for

Bolt.newBolt customers

Works with

Bolt

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Iterative Editing And Diff eval for Bolt Bolt.new test?+

Evaluates Bolt's Iterative Editing & Diff across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI App Builder eval coverage.

How is the Iterative Editing And Diff eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Iterative Editing And Diff pack for Bolt Bolt.new contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Iterative Editing And Diff pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.