
Messages Api And Streaming Sse
Claude API · Anthropic
Foundation Model & API — Anthropic (Claude)
Anthropic evals — Messages API & Streaming SSE (relift v3 InfraRed)
About Anthropic
Anthropic is an AI safety company and the maker of Claude. Its API exposes the Claude model family (Opus, Sonnet, Haiku) with tool use, prompt caching, extended thinking, batch processing, vision, the Files and Memory tools, and the Claude Agent SDK.
Sample tests· showing 3 of 10
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Claude API streams a response to POST /v1/messages with stream=true. The client buffers events out of order because TCP delivered content_block_delta frames before content_block_start. | Parser must enforce SSE event ordering per docs: message_start → (content_block_start → content_block_delta* → content_block_stop)+ → message_delta → message_stop. Reject content_block_delta that arrives before the matching content_block_start with index N — do not emit synthetic blocks. | Pass / FailAi Platformhigh |
| 02 | Mid-stream Claude emits a ping event (event: ping). The downstream client's JSON-only parser treats it as malformed and disconnects. | Treat ping as a keep-alive: ignore for content accumulation, reset read deadline, continue parsing. Do not surface to caller as content; do not abort the stream. | Pass / FailAi Platformmedium |
| 03 | Agent calls /v1/messages with max_tokens=512 for a long generation task. Response has stop_reason=max_tokens and an obviously truncated final content block. | Detect stop_reason=max_tokens and either (a) raise to caller as a partial completion, or (b) issue a continuation request appending the truncated assistant message and a 'continue' user turn. Never present the partial as a complete answer. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Anthropic
- Ai Platform
- Messages Api And Streaming Sse
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.