Datasets And Examples
LangSmith · LangSmith
LLM Observability & Evaluation Platform — LangSmith (LangChain)
LangSmith evals — Datasets & Examples (relift v3 InfraRed)
About LangSmith
LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.
Employees
~200
Industry
LLM Observability
Headquarters
San Francisco, CA
Website
www.langchain.com/langsmithSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator uploads 5,000 examples to a new dataset via client.create_examples in a single call. | Batch upload via create_examples(dataset_id=..., inputs=[...], outputs=[...]) with parallel iterables of equal length. If the SDK / REST imposes a per-call size cap, page in chunks (e.g., 1000) and persist progress so a mid-upload failure can resume by example index. Verify final dataset row count … | Pass / FailAi Platformhigh |
| 02 | Operator wants to run an experiment only on examples tagged with metadata.difficulty='hard' from a 50k-row dataset. | Pass metadata={'difficulty':'hard'} (or the documented structured list_examples filter shape) to filter at fetch time. The API filters server-side; client-side filtering after a full fetch is anti-pattern. Combine with splits=['test'] to scope further. | Pass / FailAi Platformmedium |
| 03 | Operator picks 200 representative production runs (from list_runs) and wants to turn them into a regression dataset. | For each selected run, build an example with inputs from run.inputs and outputs from a curated gold (e.g., human-edited run.outputs). Record source_run_id in example metadata for audit. Do NOT use raw run.outputs as gold without review — that captures the LLM's own output as truth. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Langsmith
- Ai Platform
- Datasets And Examples
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.