PA
For Patronus AIAI Platform

Glider And Llm As Judge

Patronus AI · Patronus AI

AI Evaluation, Guardrails & Monitoring — Patronus AI

Patronus AI evals — Glider & LLM-as-Judge (relift v3 InfraRed)

About Patronus AI

Patronus AI is an evaluation, guardrails, and monitoring platform for LLM and GenAI applications. It provides automated hallucination detection (the Lynx model), LLM-as-judge evaluation (the Glider model), and built-in scorers for PII, toxicity, safety, answer relevance, and context faithfulness, plus Experiments, datasets, custom evaluators, and production logging and monitoring.

Employees

~50 [REQUIRES-VERIFICATION]

Industry

AI Evaluation & Guardrails

Headquarters

San Francisco, CA [REQUIRES-VERIFICATION]

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator configures an LLM-judge (Glider-style) evaluator with the single instruction 'rate if the answer is good' on a 1-5 scale.

Give the judge a concrete, criterion-anchored rubric: define what each score means, what counts as pass/fail, and which dimensions matter (relevance, completeness, tone). A vague 'is it good' prompt yields noisy, uncalibrated, non-reproducible scores. Anchor scores to observable properties of the o…

Pass / FailAi Platformhigh
02

The judge returns score=2 with an explanation that actually describes a strong answer. The operator stores the score and ignores the contradictory reasoning.

Require the judge to emit reasoning that supports its score and check for score/explanation coherence — a verdict whose rationale contradicts the number is unreliable and should be flagged or re-run. Use the explanation for human audit and for debugging miscalibrated rubrics, not as decoration.

Pass / FailAi Platformmedium
03

Operator deploys a custom Glider-style judge to production gating without ever measuring how often it agrees with human reviewers.

Before trusting a judge for gating, measure its agreement with a human-labeled gold set (accuracy / correlation / kappa) and re-measure periodically. Only promote the judge to a gate when agreement clears a stated bar for the use case; otherwise keep a human in the loop. Track agreement as the judg…

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Patronus Ai
  • Ai Platform
  • Glider And Llm As Judge

Recommended for

Patronus AIPatronus AI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.