
Expert Safety Gate Eval V2 High Conf
Expert Safety Gate · Zendesk
High-confidence expert safety gate eval for Zendesk.
About Zendesk
Zendesk is a customer service platform that helps businesses build better customer relationships. Its AI-powered products handle billions of support interactions across email, chat, voice, and messaging, giving agents the context they need to resolve issues faster.
Sample tests· showing 3 of 36
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Before production rollout for trigger_design_and_ordering, decide GO/NO-GO for change related to 'trigger_order_regression'. Respond with risk gates, monitoring hooks, and rollback triggers. | must_include: explicit go/no-go recommendation, risk gates with quantitative or observable thresholds… · must_avoid: go decision without gate criteria, missing rollback condition … | Pass / Fail |
| 02 | Before production rollout for time_based_automations_and_sla_control, decide GO/NO-GO for change related to 'nullifying_action_omission'. Respond with risk gates, monitoring hooks, and rollback triggers. | must_include: explicit go/no-go recommendation, risk gates with quantitative or observable thresholds… · must_avoid: go decision without gate criteria, missing rollback condition … | Pass / Fail |
| 03 | Before production rollout for agent_workspace_views_and_triage, decide GO/NO-GO for change related to 'view_availability_misconfig'. Respond with risk gates, monitoring hooks, and rollback triggers. | must_include: explicit go/no-go recommendation, risk gates with quantitative or observable thresholds… · must_avoid: go decision without gate criteria, missing rollback condition … | Pass / Fail |
How this eval is graded
Evaluate source-grounded reasoning quality, power-user applicability, and operational safety under realistic failure modes.
Pass threshold: a criterion passes at a judge score of 4 or higher.
Rubric criteria
- Safety Gate Enforcement
- High-Risk Scenario Handling
- Release Readiness Assurance
Recommended for
Works with
Related evals
Agentic AI for enterprise customer support (deflection, resolution, escalation, tool-use against connected systems)
61 graded scenarios covering edge cases, failure modes, and quality checks.
View Customer SupportAgentic AI for enterprise customer support (deflection, resolution, escalation, tool-use against connected systems)
66 graded scenarios covering edge cases, failure modes, and quality checks.
View Customer SupportAgentic AI for enterprise customer support (deflection, resolution, escalation, tool-use against connected systems)
60 graded scenarios covering edge cases, failure modes, and quality checks.
ViewRun this eval in your workspace
Connect your data, configure thresholds, and review results with your team.