For BasetenAI Platform

Chains

Baseten · Baseten

AI Model Serving — Baseten

Evaluates Baseten's Chains across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI Model Serving eval coverage.

About Baseten

Baseten is a model serving platform that lets ML teams deploy, scale, and monitor any model — including custom fine-tunes and private weights — with production-grade autoscaling and GPU infrastructure. It supports both synchronous and asynchronous inference patterns.

Employees

~100

Industry

Model Serving

Headquarters

San Francisco, CA

Website

baseten.co

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Chain composes three chainlets: transcribe (audio→text), summarize (text→summary), translate (summary→localized). Operator returns dict from transcribe instead of the declared TranscribeOutput pydantic model.	Each chainlet declares typed inputs and outputs (pydantic models). The Chains runtime validates at the hop boundary; returning an untyped dict triggers a schema-mismatch failure at the next hop. Declare the model and import it from a shared package consumed by both chainlets.	Pass / FailAi Platformhigh
02	Client calls POST /chains/{chain_id}/run with the entrypoint chainlet's input shape. Response is the terminal chainlet's output plus a per-chainlet trace.	Address the chain via its chain_id (not by stringing predict calls across the constituent chainlet URLs — that bypasses the Chains runtime's tracing and error contract). Surface the per-chainlet trace to the operator's observability store for replay.	Pass / FailAi Platformhigh
03	Chainlet 2 of 3 raises an exception. The runtime returns a chain-level failure response with the offending chainlet's name and stack location.	Surface chain-level failures with chainlet identity (which hop failed) and the upstream chainlets' outputs where present, so the operator can resume from the failure point instead of replaying the whole graph. Honor [REQUIRES-VERIFICATION] for the exact partial-output contract.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Baseten
Ai Platform
Chains

Recommended for

BasetenBaseten customers

Works with

Baseten

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Chains eval for Baseten Baseten test?+

Evaluates Baseten's Chains across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI Model Serving eval coverage.

How is the Chains eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Chains pack for Baseten Baseten contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Chains pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.