Evaluators
LangSmith · LangSmith
LLM Observability & Evaluation Platform — LangSmith (LangChain)
LangSmith evals — Evaluators (relift v3 InfraRed)
About LangSmith
LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.
Employees
~200
Industry
LLM Observability
Headquarters
San Francisco, CA
Website
www.langchain.com/langsmithSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator writes an LLM-judge evaluator that asks Claude/GPT to score 'helpfulness' on 1-5. The evaluator returns a free-text response. | Constrain the judge to a structured output (e.g., {'score': int, 'comment': str}) via tool/function-calling on the judge model. The @run_evaluator decorator should return an EvaluationResult with key='helpfulness' and a numeric score. Free-text only is not aggregatable in the Experiments UI. | Pass / FailAi Platformhigh |
| 02 | Operator compares two prompt variants A vs B over the same dataset and wants a per-example winner instead of independent scores. | Use evaluate_comparative (or the documented pairwise API): pass both experiment ids and a comparator function that returns {'key':'preferred','scores':{exp_a:0|1, exp_b:0|1}}. Pairwise lifts judge calibration vs separate absolute scoring. Verify ties are handled deterministically (e.g., 0.5 each). | Pass / FailAi Platformmedium |
| 03 | Operator wants a deterministic evaluator that returns 1.0 when the agent's JSON output passes a JSON-schema check, else 0.0. | Implement a plain Python function decorated as a code evaluator, returning {'key':'schema_ok','score':1.0|0.0,'comment':...}. Code evaluators run in-process and do not require an LLM. Bind to dataset via evaluate(evaluators=[schema_ok_fn]). Pure functions allow reproducible regression on the same d… | Pass / FailAi Platformhigh |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Langsmith
- Ai Platform
- Evaluators
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.