Eval Library
MA
For Microsoft AutoGenAI Platform

Autogen Multi Agent Teams

AutoGen · Microsoft AutoGen

Multi-agent Framework — Microsoft AutoGen

Microsoft AutoGen evals — Multi-agent Teams (RoundRobin / Selector / Swarm / MagenticOne) (relift v3 InfraRed)

About Microsoft AutoGen

Microsoft is a global technology company and a leading cloud and AI provider. Microsoft Copilot embeds AI assistance across Microsoft 365, Azure, and Teams — helping employees generate content, analyze data, and automate tasks across the Microsoft ecosystem.

Employees

~221,000

Industry

Enterprise Software & Cloud

Headquarters

Redmond, WA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

RoundRobinGroupChat with participants=[planner, coder, reviewer]. Operator expects planner→coder→reviewer→planner alternation but observes planner speaking twice in a row when coder emits an empty response.

RoundRobinGroupChat advances strictly through participants in declared order — empty agent responses still count as that agent's turn. If a 'speak again' is needed, either re-prompt within the same turn or switch to SelectorGroupChat with allow_repeated_speaker. Do not rely on side effects to skip …

Pass / FailAi Platformmedium
02

SelectorGroupChat with selector_prompt that references participant names but each participant's description= is left blank. Selector LLM picks the same agent every turn.

Selector picks the next speaker using participant.description — populate each agent's description with a one-sentence capability statement (e.g. 'researcher: web search and fact extraction'). Without descriptions the selector has nothing to disambiguate on and degenerates to a single speaker.

Pass / FailAi Platformhigh
03

MagenticOneGroupChat run on an open-ended task ('research the most recent NIST AI RMF revision and write a summary'). Operator wraps it with no termination condition and a single model_client.

MagenticOneGroupChat must be paired with explicit max_turns or a termination condition — the Orchestrator's plan/ledger loop is designed to run until success OR until externally bounded. Configure the orchestrator and worker agents (e.g. WebSurfer, FileSurfer, Coder, ComputerTerminal) per docs; do …

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Autogen
  • Ai Platform
  • Multi Agent Teams

Recommended for

AutoGenMicrosoft AutoGen customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.