For OpenEvidenceMedical & Clinical AISearch QnaAnswer Relevance

Retrieval Pipeline Corpus Coverage

Clinical reference and decision support · OpenEvidence

50 graded scenarios covering edge cases, failure modes, and quality checks.

About OpenEvidence

OpenEvidence is a medical AI platform for clinicians that provides point-of-care answers grounded in peer-reviewed medical literature. Its mission is to organize and expand the world's collective medical knowledge.

Industry

Medical AI / Clinical Decision Support

Website

www.openevidence.com

Sample tests· showing 3 of 50

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.

#	Input	Expected behavior	Check
01	A pediatric nephrologist is managing a 22 kg child with genetically confirmed PH1 (AGXT mutation) and recurrent calcium oxalate nephrolithiasis. The child has not yet started lumasiran. The physician queries OpenEvidence during r…	The system retrieves and cites the lumasiran FDA-approved prescribing information and/or the ILLUMINATE-A or ILLUMINATE-B trial publications [REQUIRES-VERIFICATION] as primary sources. The response states the weight-based subcutaneous dosing cohort applicable to a 22 kg patient with inline citation…	Pass / FailGroundingcritical
02	A urologist is counseling a 34-year-old with a confirmed VHL germline pathogenic variant (c.208G>T) on renal surveillance. Sporadic clear-cell RCC surveillance intervals in NCCN guidelines differ materially from VHL-specific surv…	The system retrieves a VHL-specific clinical surveillance source (e.g., VHL Alliance guidelines, a hereditary RCC expert consensus, or a VHL-focused clinical review [REQUIRES-VERIFICATION]) as the primary citation. If general NCCN RCC guidelines are also retrieved, the response clearly distinguishe…	Pass / FailGroundingcritical
03	DeepConsult is generating a multi-step treatment rationale for POEMS syndrome. The licensed corpus contains exactly 4 indexed papers on POEMS systemic therapy. One of those four papers — a thalidomide case series — has a publishe…	The agent retrieves and cites only the three non-retracted POEMS thalidomide sources. The retracted paper does not appear as a primary citation. Dose and response rate figures are drawn exclusively from the non-retracted sources. If the surviving evidence base is thin, the response acknowledges tha…	Pass / FailGroundingcritical
Unlock full benchmark 47 more test cases Use this benchmark

How this eval is graded

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.

Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain.

Pass threshold: a criterion passes at a judge score of 4 or higher.

Rubric criteria

Openevidence
Clinical
Agentic
Generated

Recommended for

Clinical reference and decision supportOpenEvidence customers

Works with

OpenEvidence

Related evals

Medical & Clinical AI

Ambient clinical documentation

49 graded scenarios covering edge cases, failure modes, and quality checks.

View Medical & Clinical AI

Ambient clinical documentation

58 graded scenarios covering edge cases, failure modes, and quality checks.

View Medical & Clinical AI

Ambient clinical documentation

56 graded scenarios covering edge cases, failure modes, and quality checks.

View

Frequently asked questions

What does the Retrieval Pipeline Corpus Coverage eval for OpenEvidence Clinical reference and decision support test?+

50 graded scenarios covering edge cases, failure modes, and quality checks.

How is the Retrieval Pipeline Corpus Coverage eval scored?+

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test. The judge rubric: Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain. A criterion passes at a judge score of 4 or higher.

How many test cases does this eval pack include?+

The Retrieval Pipeline Corpus Coverage pack for OpenEvidence Clinical reference and decision support contains 50 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Retrieval Pipeline Corpus Coverage pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.