For DockerAI Platform

Docker Scout

Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner) · Docker

Container Platform — Docker

Evaluates Docker's Docker Scout across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Container Platform eval coverage.

About Docker

Docker is the container platform — Docker Engine, Docker Desktop, Docker Hub registry, Docker Build Cloud for managed cloud builders, Docker Scout for image vulnerability scanning and supply-chain policy, Docker Compose for multi-container dev, and Docker Model Runner for local LLM inference. Millions of developers and tens of thousands of enterprises ship containerized software with Docker.

Employees

~600

Industry

Developer Infrastructure

Headquarters

Palo Alto, CA

Website

www.docker.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent wants the vulnerability list for myorg/api:v1.2.3 to triage before a release.	Run 'docker scout cves myorg/api:v1.2.3'. Output lists CVEs by severity with affected package, fixed-in version, and base image. Pipe through '--only-severity critical,high' to filter. Numeric CVE counts and timing are point-in-time — vulnerability DBs update daily.	Pass / FailAi Platformhigh
02	Org has policy 'no-fixable-critical-cves' enabled. New image fails the policy with 3 fixable critical CVEs in libssl.	Run 'docker scout policy myorg/api:v1.2.3 --org myorg' to see policy results. Failing policy means the image violates the bar (e.g., fixable criticals exist). Resolve by bumping the base image (per 'scout recommendations') or updating affected packages and re-scanning. Do not merge until policy pas…	Pass / FailAi Platformcritical
03	SRE needs an SBOM for every deployed image for SOC 2 evidence.	Run 'docker scout sbom --format spdx myorg/api:v1.2.3 -o sbom.json' (or --format cyclonedx). Persist SBOM alongside the deployment manifest with the image digest. Verify SBOM contains direct + transitive dependencies. Format choice depends on downstream consumer.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Docker
Ai Platform
Docker Scout

Recommended for

Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner)Docker customers

Works with

Docker

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Docker Scout eval for Docker Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner) test?+

Evaluates Docker's Docker Scout across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Container Platform eval coverage.

How is the Docker Scout eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Docker Scout pack for Docker Docker (Engine, Hub, Build Cloud, Scout, Desktop, Model Runner) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Docker Scout pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.