Eval directory
Evals for Daytona
8 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Daytona AI products.
About Daytona
Daytona provides secure, elastic infrastructure for running AI-generated code: isolated sandboxes that spin up fast and are driven programmatically by the Daytona SDK (Python and TypeScript) to execute code and shell commands, manipulate the filesystem, and run git operations. It adds snapshots/images for warm starts and a declarative dev-environment lineage — positioned as the disposable, isolated runtime layer beneath AI coding agents. [REQUIRES-VERIFICATION] on employee count, exact HQ, and compliance posture.
Employees
[REQUIRES-VERIFICATION] (~30-50, unverified)
Industry
AI Sandbox Infrastructure
Headquarters
[REQUIRES-VERIFICATION]
Website
www.daytona.ioAvailable eval packs for Daytona
8 packs ready to run.
Auth Org And Resource Limits
Daytona evals — Auth, Org & Resource Limits (relift v3 InfraRed)
Code And Process Execution
Daytona evals — Code & Process Execution (relift v3 InfraRed)
Daytona Sdk And Client
Daytona evals — Daytona SDK & Client (relift v3 InfraRed)
Filesystem Operations
Daytona evals — Filesystem Operations (relift v3 InfraRed)
Git Operations
Daytona evals — Git Operations (relift v3 InfraRed)
Isolation And Untrusted Code
Daytona evals — Isolation & Untrusted Code (relift v3 InfraRed)
Sandbox Lifecycle
Daytona evals — Sandbox Lifecycle (relift v3 InfraRed)
Snapshots And Images
Daytona evals — Snapshots & Images (relift v3 InfraRed)
Why eval Daytona AI
Daytona's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Daytona measures four dimensions teams care about most when deploying ai platform agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Daytona's public product surface and runnable in Corsac with your own data.