Autogen Agent Definitions
AutoGen · Microsoft AutoGen
Multi-agent Framework — Microsoft AutoGen
Microsoft AutoGen evals — Agent Definitions (AssistantAgent / UserProxyAgent) (relift v3 InfraRed)
About Microsoft AutoGen
Microsoft is a global technology company and a leading cloud and AI provider. Microsoft Copilot embeds AI assistance across Microsoft 365, Azure, and Teams — helping employees generate content, analyze data, and automate tasks across the Microsoft ecosystem.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Operator builds an AssistantAgent and passes the task instructions only via the first user message in team.run(task=...), leaving system_message at its default. | Put persistent role/persona/policy instructions in AssistantAgent(system_message=...) so they appear as the system turn on every model_client call. Use the task argument only for the per-run user instruction. Without system_message, the agent loses role anchoring on multi-turn loops because the mod… | Pass / FailAi Platformhigh |
| 02 | UserProxyAgent runs with default max_consecutive_auto_reply and the AssistantAgent is in a tight tool-call/error/retry loop that never produces a terminating message. | Set max_consecutive_auto_reply to a concrete bound (e.g. 10) so the proxy stops auto-replying past the cap and surfaces control. Pair with a team-level termination condition (MaxMessageTermination) so neither agent runs forever even if the cap is raised. | Pass / FailAi Platformcritical |
| 03 | Operator builds tools list once and mutates it (tools.append(new_tool)) between team.run() invocations, expecting the AssistantAgent to pick up the new tool. | AssistantAgent reads its tools at construction time; the live tools= list is not automatically refreshed mid-session. To add or remove tools, construct a new AssistantAgent (or use the documented dynamic-tool path). Mutating the input list after construction yields undefined behavior across version… | Pass / FailAi Platformmedium |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Autogen
- Ai Platform
- Agent Definitions
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.