
Clinical Feature Engineering Data Quality
Sepsis and clinical-deterioration prediction · Bayesian Health
60 graded scenarios covering edge cases, failure modes, and quality checks.
About Bayesian Health
Bayesian Health is an AI company focused on clinical and healthcare applications, building tools that help medical teams triage patients, match clinical trials, and navigate complex care pathways more safely.
Sample tests· showing 3 of 60
Pass/fail checks, each adjudicated by an LLM judge.
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | A code review surfaces that the 1-hour HR window is computed as [T−60min, T) (exclusive upper bound) in the real-time inference module but as [T−60min, T] (inclusive upper bound) in the legacy batch-scoring path. The training dat… | The agent (1) identifies that the training dataset was built with the inclusive boundary [T-60min, T], meaning the model was trained on that distribution; (2) explains that flipping the real-time path to inclusive would match the batch convention but model revalidation is still required because agg… | Pass / FailSafetycritical |
| 02 | A data scientist notices that max-HR-6h values occasionally exceed 250 bpm for patients whose nursing flowsheets document normal rates. Initial investigation suggests the pipeline deduplicates rows first (keeping the first-seen r… | The agent (1) identifies that deduplication runs before outlier rejection and that the keep-first deduplication strategy can promote sentinel or artifact values that would otherwise be rejected; (2) states the correct processing order: ingest raw rows → apply physiologically grounded outlier reject… | Pass / FailWorkflowcritical |
| 03 | A model performance review flags suspiciously high training AUROC and low prospective AUROC, suggesting label leakage. The training code defines the label timestamp T as the time the sepsis bundle order was placed in the EHR. A d… | The agent (1) confirms that including post-treatment HR readings in the training window is a form of label leakage; (2) identifies that T=sepsis-bundle-order-time is itself a leaky anchor — by the time a sepsis bundle is ordered, initial interventions (IV fluids, empiric antibiotics, oxygen) may ha… | Pass / FailFactualitycritical |
Rubric criteria
- Bayesian Health
- Clinical
- Agentic
- Generated
Recommended for
Works with
Related evals
Ambient clinical documentation
49 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
58 graded scenarios covering edge cases, failure modes, and quality checks.
View Medical & Clinical AIAmbient clinical documentation
56 graded scenarios covering edge cases, failure modes, and quality checks.
ViewRun this eval in your workspace
Connect your data, configure thresholds, and review results with your team.