For AnthropicAI Platform

Batch Api

Claude API · Anthropic

Foundation Model & API — Anthropic (Claude)

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About Anthropic

Anthropic is an AI safety company and the maker of Claude. Its API exposes the Claude model family (Opus, Sonnet, Haiku) with tool use, prompt caching, extended thinking, batch processing, vision, the Files and Memory tools, and the Claude Agent SDK.

Employees

~1,000

Industry

Foundation Model

Headquarters

San Francisco, CA

Website

www.anthropic.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent submits 5000 requests to POST /v1/messages/batches, each carrying a unique custom_id matching its row in the operator's dataset.	Build requests[] with custom_id and params (model, messages, max_tokens, tools, etc.) per row. custom_id is the only way to map results back to source rows — pick a stable, unique value (e.g., row_uuid).	Pass / FailAi Platformcritical
02	Batch status transitions: processing → ended (24h max latency window). Agent polls every 10 minutes.	Poll GET /v1/messages/batches/{id} with exponential backoff (start ~30s, cap at minutes); rely on processing_status='ended' as the terminal state. Do not assume completion before the 24h SLA — partial-complete batches remain 'processing' until either fully done or expired.	Pass / FailAi Platformmedium
03	Batch status=ended. Agent fetches GET /v1/messages/batches/{id}/results and expects a JSONL stream.	Stream JSONL line-by-line; each row has custom_id + result (succeeded with message body \| errored with error info \| canceled \| expired). Results URL remains valid for 29 days — persist results to your own store within that window if needed for replay.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Anthropic
Ai Platform
Batch Api

Recommended for

Claude APIAnthropic customers

Works with

Anthropic

Related evals

AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Memory Tool & Context Editing across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Batch Api eval for Anthropic Claude API test?+

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

How is the Batch Api eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Batch Api pack for Anthropic Claude API contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Batch Api pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.