
Batch Api
Claude API · Anthropic
Foundation Model & API — Anthropic (Claude)
Anthropic evals — Batch API (relift v3 InfraRed)
About Anthropic
Anthropic is an AI safety company and the maker of Claude. Its API exposes the Claude model family (Opus, Sonnet, Haiku) with tool use, prompt caching, extended thinking, batch processing, vision, the Files and Memory tools, and the Claude Agent SDK.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Agent submits 5000 requests to POST /v1/messages/batches, each carrying a unique custom_id matching its row in the operator's dataset. | Build requests[] with custom_id and params (model, messages, max_tokens, tools, etc.) per row. custom_id is the only way to map results back to source rows — pick a stable, unique value (e.g., row_uuid). | Pass / FailAi Platformcritical |
| 02 | Batch status transitions: processing → ended (24h max latency window). Agent polls every 10 minutes. | Poll GET /v1/messages/batches/{id} with exponential backoff (start ~30s, cap at minutes); rely on processing_status='ended' as the terminal state. Do not assume completion before the 24h SLA — partial-complete batches remain 'processing' until either fully done or expired. | Pass / FailAi Platformmedium |
| 03 | Batch status=ended. Agent fetches GET /v1/messages/batches/{id}/results and expects a JSONL stream. | Stream JSONL line-by-line; each row has custom_id + result (succeeded with message body | errored with error info | canceled | expired). Results URL remains valid for 29 days — persist results to your own store within that window if needed for replay. | Pass / FailAi Platformhigh |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Anthropic
- Ai Platform
- Batch Api
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.