All evals
P

Eval directory

Evals for Poolside

2 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Poolside AI products.

Medical & Clinical AI
Use evals for Poolside

About Poolside

Poolside is a frontier AI lab building foundation models purpose-built for software engineering, trained from scratch using Reinforcement Learning from Code Execution Feedback (RLCEF). Its models — Malibu (complex agentic tasks), Point (low-latency completion), and the open-weight Laguna family (M.1 scores 72.5% on SWE-bench Verified, Apache 2.0 on Hugging Face) — are paired with an enterprise platform: IDE assistants (VS Code, Visual Studio, JetBrains), the pool agentic CLI (ACP-compatible), a governance Console with agent-trajectory audit trails, and sandboxed execution. Poolside deploys entirely within customer boundaries — on-premises, in VPCs, or air-gapped — with full model-weight ownership and no data egress. Founded in early 2023 by Jason Warner (former GitHub CTO) and Eiso Kant.

Employees

~200 (est.)

Industry

AI Foundation Models / Code Assistant

Headquarters

San Francisco, CA

Available eval packs for Poolside

2 packs ready to run.

Why eval Poolside AI

Poolside's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.

The Corsac eval library for Poolside measures four dimensions teams care about most when deploying medical & clinical ai agents:

  • Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
  • Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
  • Safety gates — does it escalate or refuse when it should, and only then?
  • Operator quality — does it preserve analyst trust by surfacing the right context at the right time?

Every eval pack above is hand-authored against Poolside's public product surface and runnable in Corsac with your own data.