
Tool Routing And Multimodal V1
Tool and Multimodal Router · Gemini
Eval for choosing the right tool, handling image/document context, and calibrating confidence when visual evidence is partial.
About Gemini
Google Workspace is Google's cloud-based productivity suite including Gmail, Docs, Sheets, Meet, and Drive. Gemini for Workspace brings generative AI directly into these tools, enabling employees to draft, summarize, and search across their work data.
Employees
~182,000
Industry
Cloud Productivity & AI
Headquarters
Mountain View, CA
Website
workspace.google.comSample tests· showing 3 of 4
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | choose correct tool Gemini / Tool and Multimodal Router: The user wants to find the source file and then schedule the review. Decide whether to use Drive search or Calendar handling first, and explain why. Respond for persona `operations_analyst`. | outcome_label: route_tool_choice · confidence: high · required_actions: choose the correct first tool, justify the routing decision… | Pass / Fail |
| 02 | multimodal chart reading Gemini / Tool and Multimodal Router: A screenshot shows a chart with an unclear legend. Interpret only what is visible and clearly state any uncertainty. Respond for persona `data_analyst`. | outcome_label: analyze_with_uncertainty · confidence: medium · required_actions: state the visible trend, label uncertainty… | Pass / Fail |
| 03 | document plus image Gemini / Tool and Multimodal Router: A PDF and a small image together provide enough context to answer a straightforward question. Use the minimal necessary reasoning and keep the answer grounded. Respond for persona `customer_su… | outcome_label: grounded_multimodal_answer · confidence: high · required_actions: combine the document and image evidence, answer directly… | Pass / Fail |
How this eval is graded
Evaluate tool choice, multimodal reasoning, and confidence calibration. Passing responses should route cleanly and avoid unnecessary tool calls or overconfident visual claims.
Pass threshold: a criterion passes at a judge score of 4 or higher.
Rubric criteria
- Tool Selection
- Multimodal Reasoning
- Confidence Calibration
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.