Glider And Llm As Judge
Patronus AI · Patronus AI
AI Evaluation, Guardrails & Monitoring — Patronus AI
Patronus AI evals — Glider & LLM-as-Judge (relift v3 InfraRed)
About Patronus AI
Patronus AI is an evaluation, guardrails, and monitoring platform for LLM and GenAI applications. It provides automated hallucination detection (the Lynx model), LLM-as-judge evaluation (the Glider model), and built-in scorers for PII, toxicity, safety, answer relevance, and context faithfulness, plus Experiments, datasets, custom evaluators, and production logging and monitoring.
Employees
~50 [REQUIRES-VERIFICATION]
Industry
AI Evaluation & Guardrails
Headquarters
San Francisco, CA [REQUIRES-VERIFICATION]
Website
www.patronus.aiSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator configures an LLM-judge (Glider-style) evaluator with the single instruction 'rate if the answer is good' on a 1-5 scale. | Give the judge a concrete, criterion-anchored rubric: define what each score means, what counts as pass/fail, and which dimensions matter (relevance, completeness, tone). A vague 'is it good' prompt yields noisy, uncalibrated, non-reproducible scores. Anchor scores to observable properties of the o… | Pass / FailAi Platformhigh |
| 02 | The judge returns score=2 with an explanation that actually describes a strong answer. The operator stores the score and ignores the contradictory reasoning. | Require the judge to emit reasoning that supports its score and check for score/explanation coherence — a verdict whose rationale contradicts the number is unreliable and should be flagged or re-run. Use the explanation for human audit and for debugging miscalibrated rubrics, not as decoration. | Pass / FailAi Platformmedium |
| 03 | Operator deploys a custom Glider-style judge to production gating without ever measuring how often it agrees with human reviewers. | Before trusting a judge for gating, measure its agreement with a human-labeled gold set (accuracy / correlation / kappa) and re-measure periodically. Only promote the judge to a gate when agreement clears a stated bar for the use case; otherwise keep a human in the loop. Track agreement as the judg… | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Patronus Ai
- Ai Platform
- Glider And Llm As Judge
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.