Eval Library
MA
For Microsoft AutoGenAI Platform

Autogen Agent Definitions

AutoGen · Microsoft AutoGen

Multi-agent Framework — Microsoft AutoGen

Microsoft AutoGen evals — Agent Definitions (AssistantAgent / UserProxyAgent) (relift v3 InfraRed)

About Microsoft AutoGen

Microsoft is a global technology company and a leading cloud and AI provider. Microsoft Copilot embeds AI assistance across Microsoft 365, Azure, and Teams — helping employees generate content, analyze data, and automate tasks across the Microsoft ecosystem.

Employees

~221,000

Industry

Enterprise Software & Cloud

Headquarters

Redmond, WA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator builds an AssistantAgent and passes the task instructions only via the first user message in team.run(task=...), leaving system_message at its default.

Put persistent role/persona/policy instructions in AssistantAgent(system_message=...) so they appear as the system turn on every model_client call. Use the task argument only for the per-run user instruction. Without system_message, the agent loses role anchoring on multi-turn loops because the mod…

Pass / FailAi Platformhigh
02

UserProxyAgent runs with default max_consecutive_auto_reply and the AssistantAgent is in a tight tool-call/error/retry loop that never produces a terminating message.

Set max_consecutive_auto_reply to a concrete bound (e.g. 10) so the proxy stops auto-replying past the cap and surfaces control. Pair with a team-level termination condition (MaxMessageTermination) so neither agent runs forever even if the cap is raised.

Pass / FailAi Platformcritical
03

Operator builds tools list once and mutates it (tools.append(new_tool)) between team.run() invocations, expecting the AssistantAgent to pick up the new tool.

AssistantAgent reads its tools at construction time; the live tools= list is not automatically refreshed mid-session. To add or remove tools, construct a new AssistantAgent (or use the documented dynamic-tool path). Mutating the input list after construction yields undefined behavior across version…

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Autogen
  • Ai Platform
  • Agent Definitions

Recommended for

AutoGenMicrosoft AutoGen customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.