Datasets
LangSmith · LangSmith
LLM observability and evaluation — LangSmith
LangSmith evals — Datasets (relift v3)
About LangSmith
LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.
Employees
~200
Industry
LLM Observability
Headquarters
San Francisco, CA
Website
www.langchain.com/langsmithSample tests· showing 3 of 8
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | CI pipeline emits jsonl with inputs/outputs; need programmatic bulk append. | Use langsmith.Client.create_examples with list of dicts (inputs, outputs, metadata) or upload_dataframe/upload_csv per manage-datasets-programmatically docs. | Pass / FailAi Platformhigh |
| 02 | Team maintains single dataset with split discipline for evaluate() reporting. | Pass split='train' or split=['train','test'] per example on create_examples/update_examples; filter evaluate() or analysis by split metadata. | Pass / FailAi Platformmedium |
| 03 | Multilingual dataset mixes locales; evaluator should run only on German subset. | Use client.list_examples with metadata filter (structured filter per SDK) before evaluate subset; do not download entire dataset client-side only. | Pass / FailAi Platformmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Penalize failure_modes.
Rubric criteria
- Langsmith
- Ai Platform
- Datasets
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.