For Patronus AIAI Platform

Glider And Llm As Judge

Patronus AI · Patronus AI

AI Evaluation, Guardrails & Monitoring — Patronus AI

Patronus AI evals — Glider & LLM-as-Judge (relift v3 InfraRed)

About Patronus AI

Patronus AI is an evaluation, guardrails, and monitoring platform for LLM and GenAI applications. It provides automated hallucination detection (the Lynx model), LLM-as-judge evaluation (the Glider model), and built-in scorers for PII, toxicity, safety, answer relevance, and context faithfulness, plus Experiments, datasets, custom evaluators, and production logging and monitoring.

Employees

~50 [REQUIRES-VERIFICATION]

Industry

AI Evaluation & Guardrails

Headquarters

San Francisco, CA [REQUIRES-VERIFICATION]

Website

www.patronus.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator configures an LLM-judge (Glider-style) evaluator with the single instruction 'rate if the answer is good' on a 1-5 scale.	Give the judge a concrete, criterion-anchored rubric: define what each score means, what counts as pass/fail, and which dimensions matter (relevance, completeness, tone). A vague 'is it good' prompt yields noisy, uncalibrated, non-reproducible scores. Anchor scores to observable properties of the o…	Pass / FailAi Platformhigh
02	The judge returns score=2 with an explanation that actually describes a strong answer. The operator stores the score and ignores the contradictory reasoning.	Require the judge to emit reasoning that supports its score and check for score/explanation coherence — a verdict whose rationale contradicts the number is unreliable and should be flagged or re-run. Use the explanation for human audit and for debugging miscalibrated rubrics, not as decoration.	Pass / FailAi Platformmedium
03	Operator deploys a custom Glider-style judge to production gating without ever measuring how often it agrees with human reviewers.	Before trusting a judge for gating, measure its agreement with a human-labeled gold set (accuracy / correlation / kappa) and re-measure periodically. Only promote the judge to a gate when agreement clears a stated bar for the use case; otherwise keep a human in the loop. Track agreement as the judg…	Pass / FailAi Platformcritical
Use this eval

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Patronus Ai
Ai Platform
Glider And Llm As Judge

Recommended for

Patronus AIPatronus AI customers

Works with

Patronus AI

Related evals

AI Platform

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.

Glider And Llm As Judge

About Patronus AI

Sample tests· showing 3 of 9

How this eval is graded

Rubric criteria

Recommended for

Works with

Related evals

Claude API

Claude API

Claude API