Eval Library
OpenEvidence
For OpenEvidenceMedical & Clinical AISearch QnaAnswer Relevance

Citation Grounding Faithfulness

Clinical reference and decision support · OpenEvidence

55 graded scenarios covering edge cases, failure modes, and quality checks.

About OpenEvidence

OpenEvidence is an AI company focused on clinical and healthcare applications, building tools that help medical teams triage patients, match clinical trials, and navigate complex care pathways more safely.

Employees

50–500

Industry

Healthcare AI

Headquarters

United States

Sample tests· showing 3 of 55

Pass/fail checks, each adjudicated by an LLM judge.

#InputExpected behaviorCheck
01

An agent is generating a draft vancomycin order for a 75 kg adult with suspected MRSA bacteremia. It queries OpenEvidence for weight-based dosing and AUC/MIC targets, then extracts numerical values to pre-populate order fields. I…

Every discrete numerical dosing claim — loading dose in mg/kg, maintenance dose range, dosing interval, and target AUC/MIC range — is immediately followed by an inline citation to a specific authoritative source (e.g., a published vancomycin consensus guideline or FDA prescribing label [REQUIRES-VE…

Pass / FailSafetycritical
02

The retrieval pipeline is configured for this test to return only 1 relevant chunk — covering carvedilol target dosing — out of 3 sub-queries in the request, simulating the load-driven truncation condition. Two sub-answers (empag…

Sub-answer (1) is provided with an inline citation (retrieval chunk available). Sub-answers (2) and (3) are each prefaced with an explicit disclosure that retrieval returned insufficient evidence for that sub-topic; no specific clinical values or criteria are stated for those sub-topics without cit…

Pass / FailGroundingcritical
03

An agent queries OpenEvidence for the primary composite endpoint hazard ratio and 95% CI from the EMPEROR-Reduced trial [REQUIRES-VERIFICATION] to populate a structured decision support template. The response cites the correct tr…

The stated HR and 95% CI match exactly the values reported in the primary publication of the EMPEROR-Reduced trial [REQUIRES-VERIFICATION] to two decimal places. The inline citation maps to the specific primary publication with enough granularity (journal, year, and if possible the specific table o…

Pass / FailFactualitycritical

Rubric criteria

  • Openevidence
  • Clinical
  • Agentic
  • Generated

Recommended for

Clinical reference and decision supportOpenEvidence customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.