Experiments
LangSmith · LangSmith
LLM observability and evaluation — LangSmith
LangSmith evals — Experiments (relift v3)
About LangSmith
LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.
Employees
~200
Industry
LLM Observability
Headquarters
San Francisco, CA
Website
www.langchain.com/langsmithSample tests· showing 3 of 8
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | PR gate calls evaluate() with target @traceable function and two evaluators. | Use langsmith.evaluate with max_concurrency tuned; sync driver acceptable for 200 rows; capture experiment name in CI artifact; fail build on regression threshold. | Pass / FailAi Platformhigh |
| 02 | 1000-row eval on remote target; notebook needs progress polling. | Use async evaluate patterns or run evaluate in subprocess/job; poll experiment status via client; document experiment id for later compare-experiment-results. | Pass / FailAi Platformmedium |
| 03 | Prompt v3 and v4 commits pinned; need side-by-side evaluate runs. | Pull each with pull_prompt('name:commit_hash'); run evaluate() per variant with shared dataset; compare in Experiments UI using consistent evaluator set. | Pass / FailAi Platformhigh |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Penalize failure_modes.
Rubric criteria
- Langsmith
- Ai Platform
- Experiments
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.