Code And Process Execution
Daytona · Daytona
AI Sandbox Infrastructure — Daytona
Daytona evals — Code & Process Execution (relift v3 InfraRed)
About Daytona
Daytona provides secure, elastic infrastructure for running AI-generated code: isolated sandboxes that spin up fast and are driven programmatically by the Daytona SDK (Python and TypeScript) to execute code and shell commands, manipulate the filesystem, and run git operations. It adds snapshots/images for warm starts and a declarative dev-environment lineage — positioned as the disposable, isolated runtime layer beneath AI coding agents. [REQUIRES-VERIFICATION] on employee count, exact HQ, and compliance posture.
Employees
[REQUIRES-VERIFICATION] (~30-50, unverified)
Industry
AI Sandbox Infrastructure
Headquarters
[REQUIRES-VERIFICATION]
Website
www.daytona.ioSample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Agent runs a build command inside a sandbox via the SDK's exec/code-run method and reads only stdout to decide success. | Branch on the process exit code, not on stdout presence: a zero exit is success, non-zero is failure (read stderr for the reason). Capture stdout, stderr, and exit code as three distinct fields. Do not infer success from non-empty stdout. [REQUIRES-VERIFICATION] for the exact result object field na… | Pass / FailAi Platformhigh |
| 02 | AI-generated code contains an accidental infinite loop. The agent runs it in a sandbox with no timeout. | Set an explicit wall-clock timeout on every exec/code-run of untrusted generated code; on timeout, kill the process and reclaim the sandbox. Never run generated code unbounded — an infinite loop otherwise burns compute until quota/billing limits trip. [REQUIRES-VERIFICATION] for the SDK's timeout p… | Pass / FailAi Platformcritical |
| 03 | A training/build step inside the sandbox runs for many minutes and emits incremental logs. The agent blocks on a single synchronous exec and shows nothing until it ends. | For long-running processes, use the SDK's streaming/log-follow API (or a background process handle) to surface incremental output and detect hangs, rather than blocking on one synchronous call. Distinguish 'still running' from 'stuck' via progress in the stream. [REQUIRES-VERIFICATION] for streamin… | Pass / FailAi Platformmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Daytona
- Ai Platform
- Code And Process Execution
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.