For Atropos HealthMedical & Clinical AISearch Qna

Statistical Result Interpretation Visualization

Q: How is the Statistical Result Interpretation Visualization eval scored?

Pass/fail checks, each adjudicated by an LLM judge. The judge rubric: Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain. A criterion passes at a judge score of 4 or higher.

Q: How many test cases does this eval pack include?

The Statistical Result Interpretation Visualization pack for Atropos Health Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) contains 61 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

Q: How do I run this eval?

Sign up for Corsac, connect your model or agent endpoint, and run the Statistical Result Interpretation Visualization pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) · Atropos Health

61 graded scenarios covering edge cases, failure modes, and quality checks.

About Atropos Health

Atropos Health develops a platform that generates personalized real-world evidence from healthcare data for clinical decisions and research inquiries. Its mission is to accelerate actionable evidence that improves healthcare outcomes.

Industry

Healthcare Data / Real-World Evidence

Website

www.atroposhealth.com

Sample tests· showing 3 of 61

Pass/fail checks, each adjudicated by an LLM judge.

#	Input	Expected behavior	Check
01	The Evidence Agent completes a Cox proportional-hazards analysis comparing Treatment A versus placebo for all-cause mortality in a 12,000-patient EHR cohort. The model output object carries HR = 0.75 with 95% CI [0.63, 0.89]. The…	The table cell displays HR = 0.75. The prose states that Treatment A was associated with a 25% lower hazard of all-cause mortality compared to placebo — i.e., (1 − 0.75) × 100 = 25%. The numeric percent reduction implied by the prose is mathematically consistent with the HR shown in the table. No o…	Pass / FailFactualitycritical
02	A clinician queries Alexandria about Treatment B versus placebo for major adverse cardiac events (MACE) in a real-world cohort. The Cox model returns HR = 0.75 with treatment as exposure and placebo as reference. The agent genera…	The agent identifies HR = 0.75 (< 1) as a protective effect and uses direction language such as 'lower hazard,' 'reduced risk,' or 'protective effect.' It states something equivalent to: 'Patients on Treatment B had a 25% lower hazard of MACE compared to placebo (HR = 0.75).' The words 'higher,' 'i…	Pass / FailFactualitycritical
03	The ChatRWD pipeline runs a Cox model and the internal API endpoint returns a JSON payload where the field labeled 'hr' contains −0.288 — the natural log of the hazard ratio, not the hazard ratio itself — due to a serialization b…	The agent detects that −0.288 is implausible as a hazard ratio (HRs must be strictly positive) and either: (a) converts by applying exp(), reporting HR ≈ 0.75 with an explicit note that the source value appeared to be on the log scale and was converted; or (b) flags the result as potentially errone…	Pass / FailFactualitycritical
Unlock full benchmark 58 more test cases Use this benchmark

How this eval is graded

Pass/fail checks, each adjudicated by an LLM judge.

Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain.

Pass threshold: a criterion passes at a judge score of 4 or higher.

Rubric criteria

Atropos Health
Clinical
Agentic
Generated

Recommended for

Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD)Atropos Health customers

Works with

Atropos Health

Related evals

Medical & Clinical AI

Ambient clinical documentation

49 graded scenarios covering edge cases, failure modes, and quality checks.

View Medical & Clinical AI

Ambient clinical documentation

58 graded scenarios covering edge cases, failure modes, and quality checks.

View Medical & Clinical AI

Ambient clinical documentation

56 graded scenarios covering edge cases, failure modes, and quality checks.

View

Frequently asked questions

What does the Statistical Result Interpretation Visualization eval for Atropos Health Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) test?+

61 graded scenarios covering edge cases, failure modes, and quality checks.

How is the Statistical Result Interpretation Visualization eval scored?+

Pass/fail checks, each adjudicated by an LLM judge. The judge rubric: Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain. A criterion passes at a judge score of 4 or higher.

How many test cases does this eval pack include?+

The Statistical Result Interpretation Visualization pack for Atropos Health Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) contains 61 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Statistical Result Interpretation Visualization pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.