Experiments And Datasets
Patronus AI · Patronus AI
AI Evaluation, Guardrails & Monitoring — Patronus AI
Patronus AI evals — Experiments & Datasets (relift v3 InfraRed)
About Patronus AI
Patronus AI is an evaluation, guardrails, and monitoring platform for LLM and GenAI applications. It provides automated hallucination detection (the Lynx model), LLM-as-judge evaluation (the Glider model), and built-in scorers for PII, toxicity, safety, answer relevance, and context faithfulness, plus Experiments, datasets, custom evaluators, and production logging and monitoring.
Employees
~50 [REQUIRES-VERIFICATION]
Industry
AI Evaluation & Guardrails
Headquarters
San Francisco, CA [REQUIRES-VERIFICATION]
Website
www.patronus.aiSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator runs an 'experiment' comparing prompt v1 and v2 but does not record which model version, evaluator, or dataset snapshot each run used. | An experiment must capture the full triple — model/prompt version, evaluator (and its version), and the exact dataset snapshot — so results are reproducible and comparisons are apples-to-apples. Vary one factor at a time; record the rest. A result you cannot reproduce is not a result. | Pass / FailAi Platformhigh |
| 02 | Operator builds an evaluation dataset by copying raw production transcripts containing customer names, emails, and account numbers, then shares it broadly. | Redact or synthesize PII before a production sample becomes an evaluation dataset, especially one shared across a team or vendor. Apply the redaction policy at ingestion, track consent/retention for any real data, and prefer synthetic identities where fidelity allows. Treat the dataset as subject t… | Pass / FailAi Platformcritical |
| 03 | Operator upgrades the evaluator mid-quarter and then compares this quarter's average score to last quarter's, attributing the change to model improvements. | Hold the evaluator fixed when measuring model/prompt progress over time; if the evaluator must change, re-score the historical baseline with the new evaluator before comparing. Never attribute an evaluator-driven score shift to the model. Annotate the trend line at the evaluator-change boundary. | Pass / FailAi Platformmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Patronus Ai
- Ai Platform
- Experiments And Datasets
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.