Eval Library
L
For LangSmithAI Platform

Experiments And Comparisons

LangSmith · LangSmith

LLM Observability & Evaluation Platform — LangSmith (LangChain)

LangSmith evals — Experiments & Comparisons (relift v3 InfraRed)

About LangSmith

LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.

Employees

~200

Industry

LLM Observability

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator runs evaluate(target=app_fn, data='qa_dataset', evaluators=[helpfulness]) over 500 examples.

evaluate() fans target over each example, runs evaluators on the result, and writes an Experiment in LangSmith. Provide experiment_prefix and a client to control project routing. Set max_concurrency to bound parallelism. Persist returned experiment_id for downstream comparisons.

Pass / FailAi Platformhigh
02

Operator needs concurrent target execution for an IO-bound async target calling multiple downstream APIs.

Use aevaluate() (the async variant) with an async def target. Combine with max_concurrency to bound semaphores. Do NOT mix sync evaluate() over an async target — synchronous wrapping serializes the runs and loses the IO win.

Pass / FailAi Platformmedium
03

Operator wants to compare a candidate prompt against a stable baseline experiment_id from last week.

Run the candidate via evaluate() with the same dataset name and same evaluators set as the baseline (same dataset version if reproducibility matters). Open the LangSmith Experiments view → Compare and pick baseline_id vs candidate_id to see per-example deltas. Save the comparison view URL in the PR…

Pass / FailAi Platformhigh

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Langsmith
  • Ai Platform
  • Experiments And Comparisons

Recommended for

LangSmithLangSmith customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.