Custom Evaluators And Criteria
Patronus AI · Patronus AI
AI Evaluation, Guardrails & Monitoring — Patronus AI
Patronus AI evals — Custom Evaluators & Criteria (relift v3 InfraRed)
About Patronus AI
Patronus AI is an evaluation, guardrails, and monitoring platform for LLM and GenAI applications. It provides automated hallucination detection (the Lynx model), LLM-as-judge evaluation (the Glider model), and built-in scorers for PII, toxicity, safety, answer relevance, and context faithfulness, plus Experiments, datasets, custom evaluators, and production logging and monitoring.
Employees
~50 [REQUIRES-VERIFICATION]
Industry
AI Evaluation & Guardrails
Headquarters
San Francisco, CA [REQUIRES-VERIFICATION]
Website
www.patronus.aiSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator defines a custom evaluator with the criterion 'the answer should be professional' and nothing else. | Write custom criteria as concrete, checkable statements with observable anchors (e.g. 'no profanity; addresses the user's question; cites at least one provided source when one exists') rather than abstract adjectives. Specify pass/fail boundaries so two reviewers (human or model) would agree on the… | Pass / FailAi Platformhigh |
| 02 | Operator writes a custom evaluator and immediately puts it in production gating without checking it against any labeled examples. | Validate a new custom evaluator on a labeled set with known-good and known-bad examples (including edge cases) before trusting it for gating. Confirm it passes the goods and fails the bads, measure agreement with human labels, and iterate the criteria where it disagrees. Promote to gating only afte… | Pass / FailAi Platformcritical |
| 03 | Operator writes a single custom evaluator that simultaneously judges factual accuracy, tone, formatting, and safety, returning one blended score. | Decompose distinct concerns into separate evaluators (accuracy, tone, format, safety) so each yields an actionable, independently-thresholded verdict. A single blended score hides which dimension failed and prevents per-concern gating policy. Compose them at the policy layer, not inside one rubric. | Pass / FailAi Platformmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Patronus Ai
- Ai Platform
- Custom Evaluators And Criteria
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.