
Eval directory
Evals for Puzzle
6 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Puzzle AI products.
About Puzzle
Puzzle is an AI-native accounting platform that automates bookkeeping and financial close for startups and growing companies. Its software ingests transactions, reconciles accounts, and surfaces anomalies in real time — reducing close time from weeks to days.
Available eval packs for Puzzle
6 packs ready to run.
Accounting Close Controls V1
Operational response/safety eval for Puzzle covering accounting close controls.
Audit Readiness Traceability V1
Wave 2 production eval for Puzzle focused on audit readiness traceability.
Ingest Painpoint Eval V1
Source-traceable ingest painpoint eval for Puzzle.
Expert Safety Gate Eval V2 High Conf
Task CompletionHigh-confidence expert safety gate eval for Puzzle.
Power User Ops Eval V2 High Conf
High-confidence power-user operations eval for Puzzle.
Workflow Painpoint Eval V2 High Conf
Task CompletionHigh-confidence workflow painpoint eval for Puzzle.
Why eval Puzzle AI
Puzzle's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.
The Corsac eval library for Puzzle measures four dimensions teams care about most when deploying accounting & finance agents:
- Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
- Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
- Safety gates — does it escalate or refuse when it should, and only then?
- Operator quality — does it preserve analyst trust by surfacing the right context at the right time?
Every eval pack above is hand-authored against Puzzle's public product surface and runnable in Corsac with your own data.