Eval Library
L
For LangSmithAI Platform

Experiments

LangSmith · LangSmith

LLM observability and evaluation — LangSmith

LangSmith evals — Experiments (relift v3)

About LangSmith

LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.

Employees

~200

Industry

LLM Observability

Headquarters

San Francisco, CA

Sample tests· showing 3 of 8

#InputExpected behaviorCheck
01

PR gate calls evaluate() with target @traceable function and two evaluators.

Use langsmith.evaluate with max_concurrency tuned; sync driver acceptable for 200 rows; capture experiment name in CI artifact; fail build on regression threshold.

Pass / FailAi Platformhigh
02

1000-row eval on remote target; notebook needs progress polling.

Use async evaluate patterns or run evaluate in subprocess/job; poll experiment status via client; document experiment id for later compare-experiment-results.

Pass / FailAi Platformmedium
03

Prompt v3 and v4 commits pinned; need side-by-side evaluate runs.

Pull each with pull_prompt('name:commit_hash'); run evaluate() per variant with shared dataset; compare in Experiments UI using consistent evaluator set.

Pass / FailAi Platformhigh

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Penalize failure_modes.

Rubric criteria

  • Langsmith
  • Ai Platform
  • Experiments

Recommended for

LangSmithLangSmith customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.