For FirecrawlAI Platform

Search

Firecrawl · Firecrawl

Web Data for AI — Firecrawl

Evaluates Firecrawl's Search across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Web Data for AI eval coverage.

About Firecrawl

Firecrawl is a web-data API for AI — it turns websites into clean, LLM-ready markdown or structured data via scrape, crawl, map, search, and LLM-powered extract endpoints, with JS rendering, browser actions, and proxies. Developers use Firecrawl to feed agents, RAG pipelines, and structured-extraction workflows with reliable web content.

Employees

~30

Industry

Web Data / Scraping

Headquarters

San Francisco, CA

Website

firecrawl.dev

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent wants the top web results for a query AND their cleaned content, and currently runs /v1/search then a separate /v1/scrape per result URL.	Pass scrapeOptions on the /v1/search call so each result is scraped to markdown in a single request, returning query results with content inline. Only fall back to per-URL scrape when a result needs different scrape settings.	Pass / FailAi Platformmedium
02	Agent needs only the top 5 results for a RAG step but leaves limit unset and ingests the full default result set, scraping each one.	Set the search limit (result count) to what the task needs (e.g. 5). With scrapeOptions enabled, every extra result scraped costs credits, so bound the count deliberately rather than over-fetching.	Pass / FailAi Platformmedium
03	Agent answers a 'what happened this week' question but search returns results from years ago because no time filter was applied.	Apply the tbs (time-based search) freshness filter to constrain results to the needed recency window (e.g. past week). Verify result dates rather than assuming default ordering is recency-ranked.	Pass / FailAi Platformmedium
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Firecrawl
Ai Platform
Search

Recommended for

FirecrawlFirecrawl customers

Works with

Firecrawl

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Search eval for Firecrawl Firecrawl test?+

Evaluates Firecrawl's Search across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Web Data for AI eval coverage.

How is the Search eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Search pack for Firecrawl Firecrawl contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Search pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.