Eval Library
MA
For Microsoft AutoGenAI Platform

Autogen Code Execution

AutoGen · Microsoft AutoGen

Multi-agent Framework — Microsoft AutoGen

Microsoft AutoGen evals — Code Execution (Docker / Local) (relift v3 InfraRed)

About Microsoft AutoGen

Microsoft is a global technology company and a leading cloud and AI provider. Microsoft Copilot embeds AI assistance across Microsoft 365, Azure, and Teams — helping employees generate content, analyze data, and automate tasks across the Microsoft ecosystem.

Employees

~221,000

Industry

Enterprise Software & Cloud

Headquarters

Redmond, WA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator constructs DockerCommandLineCodeExecutor inline in team.run() without calling executor.start() / stop().

DockerCommandLineCodeExecutor requires an explicit start() before first use and stop() at teardown (or use it as an async context manager). Without start(), the first execute_code_blocks() call either lazily starts a container per call (slow + leaky) or errors. Wrap with 'async with' so the contain…

Pass / FailAi Platformhigh
02

Operator picks LocalCommandLineCodeExecutor because Docker is unavailable in their CI environment, then runs an AssistantAgent that processes untrusted user prompts.

Per docs, LocalCommandLineCodeExecutor is explicitly NOT sandboxed — it runs code directly on the host. Only use it for fully trusted inputs in trusted environments. For any untrusted prompt path, switch to DockerCommandLineCodeExecutor or another sandbox; do not deploy Local executor as a 'tempora…

Pass / FailAi Platformcritical
03

Operator expects the executor to run Python and Bash blocks. Agent emits a ```javascript block.

CommandLineCodeExecutor implementations interpret code blocks by language tag. If a language isn't supported in the image, the executor returns an error result the agent can read. Pre-document which languages the executor image supports and extend the image (or pick a different executor) rather tha…

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Autogen
  • Ai Platform
  • Code Execution

Recommended for

AutoGenMicrosoft AutoGen customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.