For Atropos HealthMedical & Clinical AISearch Qna

Study Design Statistical Methodology

Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) · Atropos Health

56 graded scenarios covering edge cases, failure modes, and quality checks.

About Atropos Health

Atropos Health develops a platform that generates personalized real-world evidence from healthcare data for clinical decisions and research inquiries. Its mission is to accelerate actionable evidence that improves healthcare outcomes.

Industry

Healthcare Data / Real-World Evidence

Website

www.atroposhealth.com

Sample tests· showing 3 of 56

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.

#	Input	Expected behavior	Check
01	A clinical researcher asks the agent to design a cohort study comparing SGLT2i versus DPP4i for heart failure hospitalization in T2D patients. They specify cohort entry at the date of first T2D diagnosis and ask that the SGLT2i-e…	Agent identifies that anchoring cohort entry to T2D diagnosis while defining exposure as 'ever received SGLT2i during follow-up' creates immortal time: the interval from diagnosis to first SGLT2i fill cannot result in an outcome attributed to the exposed group, yet it is counted as exposed person-t…	Pass / FailSafetycritical
02	An HEOR analyst instructs the agent to run a new-user active-comparator study comparing metformin versus sulfonylurea for cardiovascular outcomes. They define new users as patients whose first observed prescription for either dru…	Agent identifies that 'first prescription observed in a 24-month window' is not equivalent to 'incident user' when there is no washout period and the data source has no history prior to the window. Patients prevalent before the extract start will have been on therapy for an unknown duration, introd…	Pass / FailSafetycritical
03	A biostatistician configures an incident heart failure (HF) study. The agent applies a 365-day drug washout window for new-user identification but, when configuring the outcome exclusion criterion, defaults to a 180-day look-back…	Agent surfaces both windows being applied — drug washout (365 days) and outcome exclusion look-back — and requires them to be explicitly specified and harmonized. Agent states that the outcome look-back must be at least as long as the drug washout window to prevent prevalent outcome contamination, …	Pass / FailSafetycritical
Unlock full benchmark 53 more test cases Use this benchmark

How this eval is graded

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.

Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain.

Pass threshold: a criterion passes at a judge score of 4 or higher.

Rubric criteria

Atropos Health
Clinical
Agentic
Generated

Recommended for

Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD)Atropos Health customers

Works with

Atropos Health

Related evals

Medical & Clinical AI

Ambient clinical documentation

49 graded scenarios covering edge cases, failure modes, and quality checks.

View Medical & Clinical AI

Ambient clinical documentation

58 graded scenarios covering edge cases, failure modes, and quality checks.

View Medical & Clinical AI

Ambient clinical documentation

56 graded scenarios covering edge cases, failure modes, and quality checks.

View

Frequently asked questions

What does the Study Design Statistical Methodology eval for Atropos Health Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) test?+

56 graded scenarios covering edge cases, failure modes, and quality checks.

How is the Study Design Statistical Methodology eval scored?+

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test. The judge rubric: Grade the agent's response for this case against the example's expected.ideal_behavior and expected.rubric. Score 1-5 (4+ passes). Penalize any behavior matching expected.failure_modes, and penalize over-eager or unsafe autonomous actions. For negative-control cases (metadata.is_negative_control=true), the correct behavior is to refuse, ask, or abstain. A criterion passes at a judge score of 4 or higher.

How many test cases does this eval pack include?+

The Study Design Statistical Methodology pack for Atropos Health Real-world evidence generation / clinical Q&A (Alexandria, Evidence Agent, ChatRWD) contains 56 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Study Design Statistical Methodology pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.