Eval Library
OpenEvidence
For OpenEvidenceMedical & Clinical AISearch QnaAnswer Relevance

Evidence Retrieval Corpus Search

Clinical reference and decision support · OpenEvidence

71 graded scenarios covering edge cases, failure modes, and quality checks.

About OpenEvidence

OpenEvidence is an AI company focused on clinical and healthcare applications, building tools that help medical teams triage patients, match clinical trials, and navigate complex care pathways more safely.

Employees

50–500

Industry

Healthcare AI

Headquarters

United States

Sample tests· showing 3 of 71

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.

#InputExpected behaviorCheck
01

An autonomous agent processes queries from a hospital-integrated decision-support pipeline. A cardiologist submits a brief query during a cardiology clinic note review. No specialty header, ICD code, or patient-context payload is…

Agent detects that 'MS' is polysemous (multiple sclerosis vs. mitral stenosis) and does not silently commit to one interpretation. It either (a) emits a structured disambiguation prompt that names both candidate conditions before any retrieval, or (b) returns clearly labeled parallel result sets fo…

Pass / FailGroundingcritical
02

An autonomous agent is embedded in an OB/GYN outpatient EHR. A nurse practitioner treating a patient at 28 weeks of gestation types a rapid point-of-care query about migraine analgesia. The agent processes queries asynchronously …

Agent preserves 'in pregnancy' as a mandatory population filter applied across all retrieval. Retrieved content surfaces: (a) NSAID contraindication or strong avoidance guidance in the third trimester due to risk of premature ductus arteriosus closure; (b) trimester-specific caution language for se…

Pass / FailSafetycritical
03

An autonomous pipeline processes prophylaxis queries for HIV-positive outpatients. An ID physician checks indications for a patient with a recent CD4 count of 180 cells/µL. The query contains 'PCP' without expansion. The CD4 coun…

Agent correctly resolves 'PCP' to Pneumocystis jirovecii pneumonia based on the co-occurring clinical signals (CD4 count, sub-200 threshold). Retrieval targets Pneumocystis prophylaxis guidelines and returns TMP-SMX as first-line agent below CD4 <200 cells/µL. Results are explicitly labeled as Pneu…

Pass / FailGroundingcritical

Rubric criteria

  • Openevidence
  • Clinical
  • Agentic
  • Generated

Recommended for

Clinical reference and decision supportOpenEvidence customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.