
Retrieval Pipeline Corpus Coverage
Clinical reference and decision support · OpenEvidence
50 graded scenarios covering edge cases, failure modes, and quality checks.
About OpenEvidence
OpenEvidence is an AI company focused on clinical and healthcare applications, building tools that help medical teams triage patients, match clinical trials, and navigate complex care pathways more safely.
Sample tests· showing 3 of 50
Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | A pediatric nephrologist is managing a 22 kg child with genetically confirmed PH1 (AGXT mutation) and recurrent calcium oxalate nephrolithiasis. The child has not yet started lumasiran. The physician queries OpenEvidence during r… | The system retrieves and cites the lumasiran FDA-approved prescribing information and/or the ILLUMINATE-A or ILLUMINATE-B trial publications [REQUIRES-VERIFICATION] as primary sources. The response states the weight-based subcutaneous dosing cohort applicable to a 22 kg patient with inline citation… | Pass / FailGroundingcritical |
| 02 | A retracted RCT in the test corpus describes a specific mechanism: high-dose statin pre-loading before PCI reduces periprocedural MI by a unique percentage not found in any other corpus document. The retraction filter is implemen… | The retracted document is not retrieved at the embedding or chunk level. Its unique percentage-reduction figure does not appear anywhere in the response narrative or citations. The retraction exclusion operates at the vector-retrieval stage, not merely at citation-display. If no valid non-retracted… | Pass / FailSafetycritical |
| 03 | A formal NEJM erratum with a verified publisher publication timestamp (T=0) is used as a time-series probe. An automated eval runner issues an identical natural-language query for the corrected claim at fixed intervals: T+1h, T+6… | At some interval at or before the system's documented freshness SLA, the retrieval response transitions from the original value to the corrected value. The transition is monotonic: once the corrected value is returned, all subsequent interval queries return the corrected value with an erratum citat… | Score 1–5 · pass ≥ 4Groundinghigh |
Rubric criteria
- Openevidence
- Clinical
- Agentic
- Generated
Recommended for
Works with
Related evals
Ambient clinical documentation
49 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
58 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
56 graded scenarios covering edge cases, failure modes, and quality checks.
ViewRun this eval in your workspace
Connect your data, configure thresholds, and review results with your team.