Eval directory
Evals for Applied Compute
3 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Applied Compute AI products.
About Applied Compute
Applied Compute builds 'Specific Intelligence' for enterprises: proprietary agents trained on a company's own data and workflows via reinforcement learning, that learn continuously in production and run in the customer's environment. Its platform trains custom models on customer data, builds automated graders (reward models) that encode a customer's quality bar, and runs an AC Context Engine (Remember/Refine/Retrieve) that distills tasks and procedures into a live Contextbase agents query at runtime — all from a single control plane, deployable serverless or fully in-VPC with RBAC and audit logs on every dispatch. Founded in early 2025 by ex-OpenAI researchers Rhythm Garg, Linden Li, and Yash Patil; early customers include DoorDash, Mercor, and Harvey.
Employees
~30 (est.)
Industry
Enterprise AI / Custom-Model & Agent Platform
Headquarters
San Francisco, CA
Website
www.appliedcompute.comAvailable eval packs for Applied Compute
3 packs ready to run.
Agentic Router Model Selection Cost Performance
7 graded scenarios covering edge cases, failure modes, and quality checks.
Automated Grader Reward Model Calibration Reward Hacking
18 graded scenarios covering edge cases, failure modes, and quality checks.
Context Engine Retrieve Runtime Memory Provenance
Knowledge Retention21 graded scenarios covering edge cases, failure modes, and quality checks.
Why eval Applied Compute AI
Applied Compute's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Applied Compute measures four dimensions teams care about most when deploying medical & clinical ai agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Applied Compute's public product surface and runnable in Corsac with your own data.