For LangSmithAI Platform

Datasets And Examples

LangSmith · LangSmith

LLM Observability & Evaluation Platform — LangSmith (LangChain)

Evaluates LangSmith's Datasets & Examples across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's LLM Observability & Evaluation Platform eval coverage.

About LangSmith

LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.

Employees

~200

Industry

LLM Observability

Headquarters

San Francisco, CA

Website

www.langchain.com/langsmith

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator uploads 5,000 examples to a new dataset via client.create_examples in a single call.	Batch upload via create_examples(dataset_id=..., inputs=[...], outputs=[...]) with parallel iterables of equal length. If the SDK / REST imposes a per-call size cap, page in chunks (e.g., 1000) and persist progress so a mid-upload failure can resume by example index. Verify final dataset row count …	Pass / FailAi Platformhigh
02	Operator wants 80/10/10 train/test/validation splits on a 10,000-row dataset so experiments can target a specific split.	Set 'split' on each example at create time (or via update_examples) with values like ['train'], ['test'], ['validation'] — splits are arrays so a single example can carry multiple labels. Filter via evaluate(..., splits=['test']) in subsequent experiments. Deterministic split assignment (seeded has…	Pass / FailAi Platformhigh
03	An example's expected output is corrected after a labeling error is found. An experiment run two weeks ago referenced the prior version.	Edit via client.update_example or update_examples — LangSmith versions examples and the prior experiment retains its reference to the version that ran. New experiments default to the latest version. Document any historical version pin used for compliance reproductions.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Langsmith
Ai Platform
Datasets And Examples

Recommended for

LangSmithLangSmith customers

Works with

LangSmith

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Datasets And Examples eval for LangSmith LangSmith test?+

How is the Datasets And Examples eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Datasets And Examples pack for LangSmith LangSmith contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Datasets And Examples pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.