For BoltAI Platform

Prompt To App Generation

Bolt.new · Bolt

AI App Builder — Bolt (StackBlitz)

Evaluates Bolt's Prompt-to-App Generation across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI App Builder eval coverage.

About Bolt

Bolt is StackBlitz's AI app builder at bolt.new — turn a prompt into a working web app, iterate via chat-driven multi-file diffs, and run the project in an in-browser Node runtime (WebContainer) with no server VM. Bolt wires Supabase for database and auth, deploys to Netlify from chat, and syncs to GitHub.

Employees

~50

Industry

AI App Builder

Headquarters

San Francisco, CA

Website

bolt.new

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	User opens bolt.new and types: 'Build a marketing site with a blog and SSR; I want it deployed to Netlify.' Bolt must pick a starter framework that fits SSR + Netlify-friendly output.	Pick a framework whose default build output is Netlify-deployable without extra config (e.g., Next.js, Astro, SvelteKit, Remix). State the choice in chat with one sentence of rationale before scaffolding so the user can correct on turn 1 without burning more tokens. Do not pick a CSR-only Vite + Re…	Pass / FailAi Platformhigh
02	First-turn scaffold for a Next.js app. The chat declares it will create app/, components/, lib/, package.json, next.config.js, but the diff only contains app/ and package.json.	Every file the chat declares it will create must appear in the applied diff. The WebContainer preview must boot — missing next.config.js or tailwind.config.js when the prompt asked for Tailwind is a scaffold failure. If a file is intentionally deferred, say so explicitly in chat.	Pass / FailAi Platformcritical
03	Initial scaffold writes package.json with 'react': '*' and 'next': 'latest' for the user's Next.js app.	Pin to exact or caret-pinned versions known to work together in WebContainer (e.g., 'next': '^14.2.0', 'react': '^18.3.0'). '*' and 'latest' are reproducibility hazards — every fresh install may resolve to a different working tree. Surface the pinned versions in chat so the user knows what was chos…	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Bolt
Ai Platform
Prompt To App Generation

Recommended for

Bolt.newBolt customers

Works with

Bolt

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Prompt To App Generation eval for Bolt Bolt.new test?+

Evaluates Bolt's Prompt-to-App Generation across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI App Builder eval coverage.

How is the Prompt To App Generation eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Prompt To App Generation pack for Bolt Bolt.new contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Prompt To App Generation pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.