Eval Library
L
For LangSmithAI Platform

Evaluators

LangSmith · LangSmith

LLM Observability & Evaluation Platform — LangSmith (LangChain)

LangSmith evals — Evaluators (relift v3 InfraRed)

About LangSmith

LangSmith is LangChain's LLM observability and evaluation platform: tracing, datasets, evaluators (LLM-as-judge, code, and human), experiments, prompt management, and online monitoring used by AI teams to measure and improve LLM apps in production.

Employees

~200

Industry

LLM Observability

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator writes an LLM-judge evaluator that asks Claude/GPT to score 'helpfulness' on 1-5. The evaluator returns a free-text response.

Constrain the judge to a structured output (e.g., {'score': int, 'comment': str}) via tool/function-calling on the judge model. The @run_evaluator decorator should return an EvaluationResult with key='helpfulness' and a numeric score. Free-text only is not aggregatable in the Experiments UI.

Pass / FailAi Platformhigh
02

Operator compares two prompt variants A vs B over the same dataset and wants a per-example winner instead of independent scores.

Use evaluate_comparative (or the documented pairwise API): pass both experiment ids and a comparator function that returns {'key':'preferred','scores':{exp_a:0|1, exp_b:0|1}}. Pairwise lifts judge calibration vs separate absolute scoring. Verify ties are handled deterministically (e.g., 0.5 each).

Pass / FailAi Platformmedium
03

Operator wants a deterministic evaluator that returns 1.0 when the agent's JSON output passes a JSON-schema check, else 0.0.

Implement a plain Python function decorated as a code evaluator, returning {'key':'schema_ok','score':1.0|0.0,'comment':...}. Code evaluators run in-process and do not require an LLM. Bind to dataset via evaluate(evaluators=[schema_ok_fn]). Pure functions allow reproducible regression on the same d…

Pass / FailAi Platformhigh

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Langsmith
  • Ai Platform
  • Evaluators

Recommended for

LangSmithLangSmith customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.