For BrowserbaseAI Platform

Stagehand Act Extract Observe

Browserbase (cloud headless Chromium + Stagehand SDK) · Browserbase

Browser Infrastructure for AI Agents — Browserbase

Evaluates Browserbase's act / extract / observe across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Browser Infrastructure for AI Agents eval coverage.

About Browserbase

Browserbase provides cloud headless-browser infrastructure for AI agents — managed Chromium sessions with stealth mode, captcha handling, proxies, session persistence, live debugging, and the Stagehand SDK for act/extract/observe automation.

Employees

~40

Industry

Browser Infrastructure

Headquarters

San Francisco, CA

Website

browserbase.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Stagehand init with env=BROWSERBASE. Agent calls await stagehand.page.act('click the Sign in button').	act() resolves a single high-level action against the live DOM via the model. Verify the resulting action via a post-click anchor (URL change, observe() result, or DOM presence of a known signed-in element). Do not chain multiple imperative steps in one act() string.	Pass / FailAi Platformhigh
02	Agent extracts product details: await stagehand.page.extract({instruction:'price and title', schema:z.object({price:z.string(), title:z.string()})}).	Pass a strict Zod schema so the model returns typed JSON. Validate the result at the boundary and route schema-violation back as a retry signal. Persist raw model output alongside the typed object for audit when downstream uses it for decisions.	Pass / FailAi Platformcritical
03	Single-page app renders the login button asynchronously. Agent calls act('click login') before observe() has confirmed the button is present.	Use observe() to enumerate candidate actions/selectors and confirm the target exists before act(). Re-observe if act() fails so the model's plan is grounded in the current DOM, not the prior snapshot.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Browserbase
Ai Platform
Stagehand Act Extract Observe

Recommended for

Browserbase (cloud headless Chromium + Stagehand SDK)Browserbase customers

Works with

Browserbase

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Stagehand Act Extract Observe eval for Browserbase Browserbase (cloud headless Chromium + Stagehand SDK) test?+

How is the Stagehand Act Extract Observe eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Stagehand Act Extract Observe pack for Browserbase Browserbase (cloud headless Chromium + Stagehand SDK) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Stagehand Act Extract Observe pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.