
Eval directory
Evals for Anthropic
8 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Anthropic AI products.
About Anthropic
Anthropic is an AI safety company and the maker of Claude. Its API exposes the Claude model family (Opus, Sonnet, Haiku) with tool use, prompt caching, extended thinking, batch processing, vision, the Files and Memory tools, and the Claude Agent SDK.
Available eval packs for Anthropic
8 packs ready to run.
Batch Api
Anthropic evals — Batch API (relift v3 InfraRed)
Extended Thinking
Anthropic evals — Extended Thinking (relift v3 InfraRed)
Files Api And Citations
Answer RelevanceAnthropic evals — Files API & Citations (relift v3 InfraRed)
Memory Tool And Context Editing
Knowledge RetentionAnthropic evals — Memory Tool & Context Editing (relift v3 InfraRed)
Messages Api And Streaming Sse
Anthropic evals — Messages API & Streaming SSE (relift v3 InfraRed)
Prompt Caching
Anthropic evals — Prompt Caching (relift v3 InfraRed)
Refusals Safety And Agent Sdk
Anthropic evals — Refusals, Safety & Agent SDK / Claude Code (relift v3 InfraRed)
Tool Use And Schema Validation
Tool SelectionAnthropic evals — Tool Use & Schema Validation (relift v3 InfraRed)
Why eval Anthropic AI
Anthropic's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Anthropic measures four dimensions teams care about most when deploying ai platform agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Anthropic's public product surface and runnable in Corsac with your own data.