For Patronus AIAI Platform

Experiments And Datasets

Patronus AI · Patronus AI

AI Evaluation, Guardrails & Monitoring — Patronus AI

Patronus AI evals — Experiments & Datasets (relift v3 InfraRed)

About Patronus AI

Patronus AI is an evaluation, guardrails, and monitoring platform for LLM and GenAI applications. It provides automated hallucination detection (the Lynx model), LLM-as-judge evaluation (the Glider model), and built-in scorers for PII, toxicity, safety, answer relevance, and context faithfulness, plus Experiments, datasets, custom evaluators, and production logging and monitoring.

Employees

~50 [REQUIRES-VERIFICATION]

Industry

AI Evaluation & Guardrails

Headquarters

San Francisco, CA [REQUIRES-VERIFICATION]

Website

www.patronus.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator runs an 'experiment' comparing prompt v1 and v2 but does not record which model version, evaluator, or dataset snapshot each run used.	An experiment must capture the full triple — model/prompt version, evaluator (and its version), and the exact dataset snapshot — so results are reproducible and comparisons are apples-to-apples. Vary one factor at a time; record the rest. A result you cannot reproduce is not a result.	Pass / FailAi Platformhigh
02	Operator builds an evaluation dataset by copying raw production transcripts containing customer names, emails, and account numbers, then shares it broadly.	Redact or synthesize PII before a production sample becomes an evaluation dataset, especially one shared across a team or vendor. Apply the redaction policy at ingestion, track consent/retention for any real data, and prefer synthetic identities where fidelity allows. Treat the dataset as subject t…	Pass / FailAi Platformcritical
03	Operator upgrades the evaluator mid-quarter and then compares this quarter's average score to last quarter's, attributing the change to model improvements.	Hold the evaluator fixed when measuring model/prompt progress over time; if the evaluator must change, re-score the historical baseline with the new evaluator before comparing. Never attribute an evaluator-driven score shift to the model. Annotate the trend line at the evaluator-change boundary.	Pass / FailAi Platformmedium
Use this eval

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Patronus Ai
Ai Platform
Experiments And Datasets

Recommended for

Patronus AIPatronus AI customers

Works with

Patronus AI

Related evals

AI Platform

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.

Experiments And Datasets

About Patronus AI

Sample tests· showing 3 of 9

How this eval is graded

Rubric criteria

Recommended for

Works with

Related evals

Claude API

Claude API

Claude API