For Microsoft AutoGenAI Platform

Autogen Code Execution

AutoGen · Microsoft AutoGen

Multi-agent Framework — Microsoft AutoGen

Evaluates Microsoft AutoGen's Code Execution (Docker / Local) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

About Microsoft AutoGen

Microsoft is a global technology company and a leading cloud and AI provider. Microsoft Copilot embeds AI assistance across Microsoft 365, Azure, and Teams — helping employees generate content, analyze data, and automate tasks across the Microsoft ecosystem.

Employees

~221,000

Industry

Enterprise Software & Cloud

Headquarters

Redmond, WA

Website

microsoft.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator constructs DockerCommandLineCodeExecutor inline in team.run() without calling executor.start() / stop().	DockerCommandLineCodeExecutor requires an explicit start() before first use and stop() at teardown (or use it as an async context manager). Without start(), the first execute_code_blocks() call either lazily starts a container per call (slow + leaky) or errors. Wrap with 'async with' so the contain…	Pass / FailAi Platformhigh
02	Operator picks LocalCommandLineCodeExecutor because Docker is unavailable in their CI environment, then runs an AssistantAgent that processes untrusted user prompts.	Per docs, LocalCommandLineCodeExecutor is explicitly NOT sandboxed — it runs code directly on the host. Only use it for fully trusted inputs in trusted environments. For any untrusted prompt path, switch to DockerCommandLineCodeExecutor or another sandbox; do not deploy Local executor as a 'tempora…	Pass / FailAi Platformcritical
03	DockerCommandLineCodeExecutor is configured with work_dir=Path('/'). Operator expects 'the container has access to the workspace'.	work_dir bind-mounts a host directory into the container. Choose a narrowly-scoped directory (e.g. a per-session tempdir) — never the host root. Validate the work_dir is created by the executor, owned by an unprivileged user, and cleaned up at stop().	Pass / FailAi Platformcritical
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Autogen
Ai Platform
Code Execution

Recommended for

AutoGenMicrosoft AutoGen customers

Works with

Microsoft AutoGen

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Autogen Code Execution eval for Microsoft AutoGen AutoGen test?+

How is the Autogen Code Execution eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Autogen Code Execution pack for Microsoft AutoGen AutoGen contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Autogen Code Execution pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.