Autogen Code Execution
AutoGen · Microsoft AutoGen
Multi-agent Framework — Microsoft AutoGen
Microsoft AutoGen evals — Code Execution (Docker / Local) (relift v3 InfraRed)
About Microsoft AutoGen
Microsoft is a global technology company and a leading cloud and AI provider. Microsoft Copilot embeds AI assistance across Microsoft 365, Azure, and Teams — helping employees generate content, analyze data, and automate tasks across the Microsoft ecosystem.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator constructs DockerCommandLineCodeExecutor inline in team.run() without calling executor.start() / stop(). | DockerCommandLineCodeExecutor requires an explicit start() before first use and stop() at teardown (or use it as an async context manager). Without start(), the first execute_code_blocks() call either lazily starts a container per call (slow + leaky) or errors. Wrap with 'async with' so the contain… | Pass / FailAi Platformhigh |
| 02 | Operator picks LocalCommandLineCodeExecutor because Docker is unavailable in their CI environment, then runs an AssistantAgent that processes untrusted user prompts. | Per docs, LocalCommandLineCodeExecutor is explicitly NOT sandboxed — it runs code directly on the host. Only use it for fully trusted inputs in trusted environments. For any untrusted prompt path, switch to DockerCommandLineCodeExecutor or another sandbox; do not deploy Local executor as a 'tempora… | Pass / FailAi Platformcritical |
| 03 | Operator expects the executor to run Python and Bash blocks. Agent emits a ```javascript block. | CommandLineCodeExecutor implementations interpret code blocks by language tag. If a language isn't supported in the image, the executor returns an error result the agent can read. Pre-document which languages the executor image supports and extend the image (or pick a different executor) rather tha… | Pass / FailAi Platformmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Autogen
- Ai Platform
- Code Execution
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.