
Chains
Baseten · Baseten
AI Model Serving — Baseten
Baseten evals — Chains (relift v3 InfraRed)
About Baseten
Baseten is a model serving platform that lets ML teams deploy, scale, and monitor any model — including custom fine-tunes and private weights — with production-grade autoscaling and GPU infrastructure. It supports both synchronous and asynchronous inference patterns.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Chain composes three chainlets: transcribe (audio→text), summarize (text→summary), translate (summary→localized). Operator returns dict from transcribe instead of the declared TranscribeOutput pydantic model. | Each chainlet declares typed inputs and outputs (pydantic models). The Chains runtime validates at the hop boundary; returning an untyped dict triggers a schema-mismatch failure at the next hop. Declare the model and import it from a shared package consumed by both chainlets. | Pass / FailAi Platformhigh |
| 02 | Finance asks: how much of the chain's spend is the LLM hop vs the embedding hop? | Chains usage reporting breaks GPU-seconds out per chainlet. Aggregate over the billing window per chainlet name to attribute cost. Do not divide chain-level spend by chainlet count — costs are not uniform across hops. | Pass / FailAi Platformmedium |
| 03 | Operator changes the TranscribeOutput schema (adds a `language` field). Summarize chainlet still references the old shape. | Treat the input/output schemas as a shared contract. Bump both chainlets in lockstep, push them as a single chain deployment, and avoid in-place schema edits to a production chain. Use additive changes (optional field with default) when full lockstep is not feasible. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Baseten
- Ai Platform
- Chains
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.