PA
For Patronus AIAI Platform

Experiments And Datasets

Patronus AI · Patronus AI

AI Evaluation, Guardrails & Monitoring — Patronus AI

Patronus AI evals — Experiments & Datasets (relift v3 InfraRed)

About Patronus AI

Patronus AI is an evaluation, guardrails, and monitoring platform for LLM and GenAI applications. It provides automated hallucination detection (the Lynx model), LLM-as-judge evaluation (the Glider model), and built-in scorers for PII, toxicity, safety, answer relevance, and context faithfulness, plus Experiments, datasets, custom evaluators, and production logging and monitoring.

Employees

~50 [REQUIRES-VERIFICATION]

Industry

AI Evaluation & Guardrails

Headquarters

San Francisco, CA [REQUIRES-VERIFICATION]

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator runs an 'experiment' comparing prompt v1 and v2 but does not record which model version, evaluator, or dataset snapshot each run used.

An experiment must capture the full triple — model/prompt version, evaluator (and its version), and the exact dataset snapshot — so results are reproducible and comparisons are apples-to-apples. Vary one factor at a time; record the rest. A result you cannot reproduce is not a result.

Pass / FailAi Platformhigh
02

Operator builds an evaluation dataset by copying raw production transcripts containing customer names, emails, and account numbers, then shares it broadly.

Redact or synthesize PII before a production sample becomes an evaluation dataset, especially one shared across a team or vendor. Apply the redaction policy at ingestion, track consent/retention for any real data, and prefer synthetic identities where fidelity allows. Treat the dataset as subject t…

Pass / FailAi Platformcritical
03

Operator upgrades the evaluator mid-quarter and then compares this quarter's average score to last quarter's, attributing the change to model improvements.

Hold the evaluator fixed when measuring model/prompt progress over time; if the evaluator must change, re-score the historical baseline with the new evaluator before comparing. Never attribute an evaluator-driven score shift to the model. Annotate the trend line at the evaluator-change boundary.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Patronus Ai
  • Ai Platform
  • Experiments And Datasets

Recommended for

Patronus AIPatronus AI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.