For Microsoft AutoGenAI Platform

Autogen Agent Definitions

AutoGen · Microsoft AutoGen

Multi-agent Framework — Microsoft AutoGen

Evaluates Microsoft AutoGen's Agent Definitions (AssistantAgent / UserProxyAgent) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Multi-agent Framework eval coverage.

About Microsoft AutoGen

Microsoft is a global technology company and a leading cloud and AI provider. Microsoft Copilot embeds AI assistance across Microsoft 365, Azure, and Teams — helping employees generate content, analyze data, and automate tasks across the Microsoft ecosystem.

Employees

~221,000

Industry

Enterprise Software & Cloud

Headquarters

Redmond, WA

Website

microsoft.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator builds an AssistantAgent and passes the task instructions only via the first user message in team.run(task=...), leaving system_message at its default.	Put persistent role/persona/policy instructions in AssistantAgent(system_message=...) so they appear as the system turn on every model_client call. Use the task argument only for the per-run user instruction. Without system_message, the agent loses role anchoring on multi-turn loops because the mod…	Pass / FailAi Platformhigh
02	Two AssistantAgents share a single OpenAIChatCompletionClient instance and run inside a RoundRobinGroupChat. One agent calls await model_client.close() at end of its first turn.	Either construct a model_client per agent OR treat the shared client as team-scoped and close it ONCE at team teardown — not from within an agent. Closing mid-run kills the other agent's next call. Document the lifecycle (per-agent vs per-team) in the integration.	Pass / FailAi Platformhigh
03	AssistantAgent is configured with reflect_on_tool_use=True. The model emits a tool call, the tool returns a 4KB JSON blob, and the operator expects the next message to be a natural-language summary.	With reflect_on_tool_use=True, after the tool returns AutoGen re-invokes the model with the tool result appended to the context so the model produces a natural-language reply. Verify the second model call actually happens (extra latency + tokens) and do not assume the raw tool output IS the assista…	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Autogen
Ai Platform
Agent Definitions

Recommended for

AutoGenMicrosoft AutoGen customers

Works with

Microsoft AutoGen

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Autogen Agent Definitions eval for Microsoft AutoGen AutoGen test?+

How is the Autogen Agent Definitions eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Autogen Agent Definitions pack for Microsoft AutoGen AutoGen contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Autogen Agent Definitions pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.