For BasetenAI Platform

Truss And Model Packaging

Baseten · Baseten

AI Model Serving — Baseten

Evaluates Baseten's Truss & Model Packaging across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI Model Serving eval coverage.

About Baseten

Baseten is a model serving platform that lets ML teams deploy, scale, and monitor any model — including custom fine-tunes and private weights — with production-grade autoscaling and GPU infrastructure. It supports both synchronous and asynchronous inference patterns.

Employees

~100

Industry

Model Serving

Headquarters

San Francisco, CA

Website

baseten.co

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator scaffolds a new Truss with `truss init` and edits config.yaml to declare the model. They omit model_metadata.example_model_input.	config.yaml must declare model_metadata (including example_model_input for the in-product playground), python_version, requirements (pinned), and resources.accelerator. Missing example_model_input causes the playground to render without a usable form. Validate config.yaml with `truss config validat…	Pass / FailAi Platformhigh
02	Operator deploys a 13B-parameter LLM with resources.accelerator: A10G (24GB).	Choose an accelerator whose VRAM fits the model weights + KV cache headroom: 13B FP16 needs ~26GB just for weights; A10G overflows. Select A100 (40/80GB) or H100. Verify VRAM headroom for max_seq_len, not just steady-state. [REQUIRES-VERIFICATION] for current accelerator SKUs and per-class VRAM.	Pass / FailAi Platformcritical
03	Model.py uses cv2 (opencv-python) which requires libgl1 on the OS. Truss declares opencv-python in requirements but no system_packages.	Declare native OS dependencies under system_packages (e.g., libgl1, ffmpeg, libsndfile1). requirements: only covers Python wheels — native libs must be installed at the image layer. Without system_packages, deploy may build but fail at first import on the replica.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Baseten
Ai Platform
Truss And Model Packaging

Recommended for

BasetenBaseten customers

Works with

Baseten

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Truss And Model Packaging eval for Baseten Baseten test?+

Evaluates Baseten's Truss & Model Packaging across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI Model Serving eval coverage.

How is the Truss And Model Packaging eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Truss And Model Packaging pack for Baseten Baseten contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Truss And Model Packaging pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.