For BasetenAI Platform

Training And Finetuning

Baseten · Baseten

AI Model Serving — Baseten

Evaluates Baseten's Training & Fine-tuning across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI Model Serving eval coverage.

About Baseten

Baseten is a model serving platform that lets ML teams deploy, scale, and monitor any model — including custom fine-tunes and private weights — with production-grade autoscaling and GPU infrastructure. It supports both synchronous and asynchronous inference patterns.

Employees

~100

Industry

Model Serving

Headquarters

San Francisco, CA

Website

baseten.co

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator POSTs to /v1/training/jobs with base_model, dataset_uri (s3://...), GPU class (H100), and hyperparameters. Response carries job_id.	Persist job_id immediately with dataset hash + hyperparameter snapshot for reproducibility. Status transitions PENDING → RUNNING → SUCCEEDED\|FAILED\|CANCELED. On SUCCEEDED, fetch checkpoint URI from the job record and validate it deserializes before deploying.	Pass / FailAi Platformhigh
02	Dataset is 12 GB in operator's S3 bucket. Operator passes dataset_uri=s3://my-bucket/data.jsonl in the job spec.	Baseten reads dataset_uri using the workspace's configured S3 credentials. Verify the workspace has IAM access to the bucket BEFORE submitting the job; otherwise the job will start, fail at dataset-load, and consume queue time. Use workspace-attached storage when cross-account IAM is friction.	Pass / FailAi Platformcritical
03	Training succeeds; checkpoint URI points at s3://baseten-workspace-bucket/checkpoints/<job_id>/. Operator wants to deploy.	Register the checkpoint as a Truss model_metadata.weights_uri reference and push a Truss that loads from it. Validate load() against a smoke-test input in /development/predict before promoting. Persist the (job_id → deployed model version) mapping for lineage.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Baseten
Ai Platform
Training And Finetuning

Recommended for

BasetenBaseten customers

Works with

Baseten

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Training And Finetuning eval for Baseten Baseten test?+

Evaluates Baseten's Training & Fine-tuning across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's AI Model Serving eval coverage.

How is the Training And Finetuning eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Training And Finetuning pack for Baseten Baseten contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Training And Finetuning pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.