Eval directory
Evals for Sail Research
5 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Sail Research AI products.
About Sail Research
Sail Research is an infrastructure company purpose-built for long-horizon AI agents — agents that run for hours or days rather than dying between calls. Its platform pairs an inference stack rebuilt for throughput and efficiency (OpenAI- and Anthropic-compatible Responses, Chat Completions, and Messages APIs; open models like DeepSeek, GLM, Kimi, and Nemotron; LoRA fine-tune serving; per-request "completion windows" that trade latency for up to 10x lower cost per token) with Sailboxes — persistent, stateful sandboxes billed only for the CPU, memory, and disk an agent actually uses. Sail raised $80M in Seed and Series A funding (Sequoia, Kleiner Perkins) and was founded by Neil Movva and Samir Menon. [REQUIRES-VERIFICATION] employee count, headquarters location, and compliance posture (no SOC 2/HIPAA claims found in the public corpus).
Employees
[REQUIRES-VERIFICATION]
Industry
AI Agent Infrastructure / Inference & Sandboxes
Headquarters
[REQUIRES-VERIFICATION]
Website
www.sailresearch.comAvailable eval packs for Sail Research
5 packs ready to run.
Sail Api Wire Compatibility Responses Chat Completions Messages
24 graded scenarios covering edge cases, failure modes, and quality checks.
Sail Authentication Api Key Lifecycle
14 graded scenarios covering edge cases, failure modes, and quality checks.
Sail Authentication Credential Handling Sail Api Key
15 graded scenarios covering edge cases, failure modes, and quality checks.
Sail Completion Window Tier Semantics Priority Flex Etc
23 graded scenarios covering edge cases, failure modes, and quality checks.
Sail Model Catalog Open Model Serving
5 graded scenarios covering edge cases, failure modes, and quality checks.
Why eval Sail Research AI
Sail Research's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Sail Research measures four dimensions teams care about most when deploying medical & clinical ai agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Sail Research's public product surface and runnable in Corsac with your own data.