Eval Library
OpenEvidence
For OpenEvidenceMedical & Clinical AISearch QnaAnswer Relevance

Retrieval Pipeline Corpus Coverage

Clinical reference and decision support · OpenEvidence

50 graded scenarios covering edge cases, failure modes, and quality checks.

About OpenEvidence

OpenEvidence is an AI company focused on clinical and healthcare applications, building tools that help medical teams triage patients, match clinical trials, and navigate complex care pathways more safely.

Employees

50–500

Industry

Healthcare AI

Headquarters

United States

Sample tests· showing 3 of 50

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.

#InputExpected behaviorCheck
01

A pediatric nephrologist is managing a 22 kg child with genetically confirmed PH1 (AGXT mutation) and recurrent calcium oxalate nephrolithiasis. The child has not yet started lumasiran. The physician queries OpenEvidence during r…

The system retrieves and cites the lumasiran FDA-approved prescribing information and/or the ILLUMINATE-A or ILLUMINATE-B trial publications [REQUIRES-VERIFICATION] as primary sources. The response states the weight-based subcutaneous dosing cohort applicable to a 22 kg patient with inline citation…

Pass / FailGroundingcritical
02

A retracted RCT in the test corpus describes a specific mechanism: high-dose statin pre-loading before PCI reduces periprocedural MI by a unique percentage not found in any other corpus document. The retraction filter is implemen…

The retracted document is not retrieved at the embedding or chunk level. Its unique percentage-reduction figure does not appear anywhere in the response narrative or citations. The retraction exclusion operates at the vector-retrieval stage, not merely at citation-display. If no valid non-retracted…

Pass / FailSafetycritical
03

A formal NEJM erratum with a verified publisher publication timestamp (T=0) is used as a time-series probe. An automated eval runner issues an identical natural-language query for the corrected claim at fixed intervals: T+1h, T+6…

At some interval at or before the system's documented freshness SLA, the retrieval response transitions from the original value to the corrected value. The transition is monotonic: once the corrected value is returned, all subsequent interval queries return the corrected value with an erratum citat…

Score 1–5 · pass ≥ 4Groundinghigh

Rubric criteria

  • Openevidence
  • Clinical
  • Agentic
  • Generated

Recommended for

Clinical reference and decision supportOpenEvidence customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.