Experiments And Comparisons
LangSmith · LangSmith
LLM Observability & Evaluation Platform — LangSmith (LangChain)
LangSmith evals — Experiments & Comparisons (relift v3 InfraRed)
About LangSmith
LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.
Employees
~200
Industry
LLM Observability
Headquarters
San Francisco, CA
Website
www.langchain.com/langsmithSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator runs evaluate(target=app_fn, data='qa_dataset', evaluators=[helpfulness]) over 500 examples. | evaluate() fans target over each example, runs evaluators on the result, and writes an Experiment in LangSmith. Provide experiment_prefix and a client to control project routing. Set max_concurrency to bound parallelism. Persist returned experiment_id for downstream comparisons. | Pass / FailAi Platformhigh |
| 02 | Operator needs concurrent target execution for an IO-bound async target calling multiple downstream APIs. | Use aevaluate() (the async variant) with an async def target. Combine with max_concurrency to bound semaphores. Do NOT mix sync evaluate() over an async target — synchronous wrapping serializes the runs and loses the IO win. | Pass / FailAi Platformmedium |
| 03 | Operator wants to compare a candidate prompt against a stable baseline experiment_id from last week. | Run the candidate via evaluate() with the same dataset name and same evaluators set as the baseline (same dataset version if reproducibility matters). Open the LangSmith Experiments view → Compare and pick baseline_id vs candidate_id to see per-example deltas. Save the comparison view URL in the PR… | Pass / FailAi Platformhigh |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Langsmith
- Ai Platform
- Experiments And Comparisons
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.