Eval Library
L
For LangSmithAI Platform

Datasets And Examples

LangSmith · LangSmith

LLM Observability & Evaluation Platform — LangSmith (LangChain)

LangSmith evals — Datasets & Examples (relift v3 InfraRed)

About LangSmith

LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.

Employees

~200

Industry

LLM Observability

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator uploads 5,000 examples to a new dataset via client.create_examples in a single call.

Batch upload via create_examples(dataset_id=..., inputs=[...], outputs=[...]) with parallel iterables of equal length. If the SDK / REST imposes a per-call size cap, page in chunks (e.g., 1000) and persist progress so a mid-upload failure can resume by example index. Verify final dataset row count …

Pass / FailAi Platformhigh
02

Operator wants to run an experiment only on examples tagged with metadata.difficulty='hard' from a 50k-row dataset.

Pass metadata={'difficulty':'hard'} (or the documented structured list_examples filter shape) to filter at fetch time. The API filters server-side; client-side filtering after a full fetch is anti-pattern. Combine with splits=['test'] to scope further.

Pass / FailAi Platformmedium
03

Operator picks 200 representative production runs (from list_runs) and wants to turn them into a regression dataset.

For each selected run, build an example with inputs from run.inputs and outputs from a curated gold (e.g., human-edited run.outputs). Record source_run_id in example metadata for audit. Do NOT use raw run.outputs as gold without review — that captures the LLM's own output as truth.

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Langsmith
  • Ai Platform
  • Datasets And Examples

Recommended for

LangSmithLangSmith customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.