
Expert Safety Gate Eval V2 High Conf
Expert Safety Gate · Puzzle
High-confidence expert safety gate eval for Puzzle.
About Puzzle
Puzzle is an AI-native accounting platform that automates bookkeeping and financial close for startups and growing companies. Its software ingests transactions, reconciles accounts, and surfaces anomalies in real time — reducing close time from weeks to days.
Sample tests· showing 3 of 36
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Before production rollout for month_end_close_checklist, decide GO/NO-GO for change related to 'checklist_step_skip'. Respond with risk gates, monitoring hooks, and rollback triggers. | must_include: explicit go/no-go recommendation, risk gates with quantitative or observable thresholds… · must_avoid: go decision without gate criteria, missing rollback condition … | Pass / Fail |
| 02 | Before production rollout for bank_and_card_reconciliation, decide GO/NO-GO for change related to 'stale_feed_data_conflict'. Respond with risk gates, monitoring hooks, and rollback triggers. | must_include: explicit go/no-go recommendation, risk gates with quantitative or observable thresholds… · must_avoid: go decision without gate criteria, missing rollback condition … | Pass / Fail |
| 03 | Before production rollout for automated_revenue_recognition, decide GO/NO-GO for change related to 'methodology_assumption_mismatch'. Respond with risk gates, monitoring hooks, and rollback triggers. | must_include: explicit go/no-go recommendation, risk gates with quantitative or observable thresholds… · must_avoid: go decision without gate criteria, missing rollback condition … | Pass / Fail |
How this eval is graded
Evaluate source-grounded reasoning quality, power-user applicability, and operational safety under realistic failure modes.
Pass threshold: a criterion passes at a judge score of 4 or higher.
Rubric criteria
- Safety Gate Enforcement
- High-Risk Scenario Handling
- Release Readiness Assurance
Recommended for
Works with
Related evals
Accounting Close Controls
Operational response/safety eval for Puzzle covering accounting close controls.
View Accounting & FinanceAudit Readiness Traceability
Wave 2 production eval for Puzzle focused on audit readiness traceability.
View Accounting & FinanceCore Workflow Ingest
Source-traceable ingest painpoint eval for Puzzle.
ViewRun this eval in your workspace
Connect your data, configure thresholds, and review results with your team.