Eval Library
L
For LangSmithAI Platform

Datasets

LangSmith · LangSmith

LLM observability and evaluation — LangSmith

LangSmith evals — Datasets (relift v3)

About LangSmith

LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.

Employees

~200

Industry

LLM Observability

Headquarters

San Francisco, CA

Sample tests· showing 3 of 8

#InputExpected behaviorCheck
01

CI pipeline emits jsonl with inputs/outputs; need programmatic bulk append.

Use langsmith.Client.create_examples with list of dicts (inputs, outputs, metadata) or upload_dataframe/upload_csv per manage-datasets-programmatically docs.

Pass / FailAi Platformhigh
02

Team maintains single dataset with split discipline for evaluate() reporting.

Pass split='train' or split=['train','test'] per example on create_examples/update_examples; filter evaluate() or analysis by split metadata.

Pass / FailAi Platformmedium
03

Multilingual dataset mixes locales; evaluator should run only on German subset.

Use client.list_examples with metadata filter (structured filter per SDK) before evaluate subset; do not download entire dataset client-side only.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Penalize failure_modes.

Rubric criteria

  • Langsmith
  • Ai Platform
  • Datasets

Recommended for

LangSmithLangSmith customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.