
Statistical Result Interpretation Visualization
Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) · Atropos Health
61 graded scenarios covering edge cases, failure modes, and quality checks.
About Atropos Health
Atropos Health is an AI company focused on clinical and healthcare applications, building tools that help medical teams triage patients, match clinical trials, and navigate complex care pathways more safely.
Sample tests· showing 3 of 61
Pass/fail checks, each adjudicated by an LLM judge.
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | The Evidence Agent completes a Cox proportional-hazards analysis comparing Treatment A versus placebo for all-cause mortality in a 12,000-patient EHR cohort. The model output object carries HR = 0.75 with 95% CI [0.63, 0.89]. The… | The table cell displays HR = 0.75. The prose states that Treatment A was associated with a 25% lower hazard of all-cause mortality compared to placebo — i.e., (1 − 0.75) × 100 = 25%. The numeric percent reduction implied by the prose is mathematically consistent with the HR shown in the table. No o… | Pass / FailFactualitycritical |
| 02 | An Evidence Agent runs a multi-endpoint Cox model returning a structured results object with three endpoint-specific HRs for SGLT2 inhibitors versus DPP-4 inhibitors in a heart failure cohort: all-cause mortality (HR = 0.75, fiel… | The agent selects the 'all_cause_mortality' endpoint (HR = 0.75) and reports it as the answer to the all-cause mortality question. The output explicitly names the endpoint as 'all-cause mortality' and cites HR = 0.75. HR = 0.62 and HR = 0.91 are not presented as the primary answer; if mentioned, th… | Pass / FailGroundingcritical |
| 03 | An Evidence Agent session previously computed and cached HR = 0.62 for Drug A versus Drug B (all-cause mortality, hypertension cohort). The user now submits a new query for Drug C versus Drug D in the same cohort and endpoint. Du… | The agent triggers a fresh model computation for Drug C versus Drug D. The returned HR is 1.15 (or the correct value for that comparison — not 0.62), labeled with Drug C as exposure and Drug D as reference. The response includes provenance indicating a new computation was performed (e.g., run times… | Pass / FailWorkflowcritical |
Rubric criteria
- Atropos Health
- Clinical
- Agentic
- Generated
Recommended for
Works with
Related evals
Ambient clinical documentation
49 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
58 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
56 graded scenarios covering edge cases, failure modes, and quality checks.
ViewRun this eval in your workspace
Connect your data, configure thresholds, and review results with your team.