For DockerAI Platform

Docker Engine Containers Runtime

Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner) · Docker

Container Platform — Docker

Evaluates Docker's Docker Engine, Containers & Runtime across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Container Platform eval coverage.

About Docker

Docker is the container platform — Docker Engine, Docker Desktop, Docker Hub registry, Docker Build Cloud for managed cloud builders, Docker Scout for image vulnerability scanning and supply-chain policy, Docker Compose for multi-container dev, and Docker Model Runner for local LLM inference. Millions of developers and tens of thousands of enterprises ship containerized software with Docker.

Employees

~600

Industry

Developer Infrastructure

Headquarters

Palo Alto, CA

Website

www.docker.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Container reports 'Up' but the app inside crashed and is no longer serving HTTP. Load balancer keeps sending traffic.	Add HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 CMD curl -fsS http://localhost:8080/health \|\| exit 1 in the Dockerfile (or healthcheck in compose). 'docker ps' status becomes 'healthy' / 'unhealthy'. Load balancer and depends_on consumers gate on health.	Pass / FailAi Platformhigh
02	DB container uses '-v /home/user/db:/var/lib/postgres' — host path. After host upgrade, the host path changes and DB data appears empty.	Use named volume: '-v pgdata:/var/lib/postgres' with 'docker volume create pgdata'. Engine manages storage path. Stable across host upgrades. Bind mounts are appropriate for source-code dev where host editability matters.	Pass / FailAi Platformcritical
03	Single-host dev uses default bridge; multi-host swarm needs services to talk across hosts.	Default bridge connects containers on a single host only. For multi-host, use an overlay network in Swarm mode (or external orchestrator). Overlay is encrypted (--opt encrypted) and uses VXLAN. Don't pick host network for multi-tenant — it bypasses isolation.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Docker
Ai Platform
Docker Engine Containers Runtime

Recommended for

Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner)Docker customers

Works with

Docker

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Docker Engine Containers Runtime eval for Docker Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner) test?+

How is the Docker Engine Containers Runtime eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Docker Engine Containers Runtime pack for Docker Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Docker Engine Containers Runtime pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.