Eval directory

Evals for Puzzle

6 evaluation packs covering adversarial robustness, safety gates, workflow quality, and operator-level checks for Puzzle AI products.

Accounting & Finance

Use evals for Puzzle

About Puzzle

Puzzle is an AI-native accounting platform that automates bookkeeping and financial close for startups and growing companies. Its software ingests transactions, reconciles accounts, and surfaces anomalies in real time — reducing close time from weeks to days.

Employees

~60

Industry

Accounting Software

Headquarters

San Francisco, CA

Website

puzzle.io

Available eval packs for Puzzle

6 packs ready to run.

Why eval Puzzle AI

Puzzle's AI features ship behind brand promises about accuracy, safety, and reliability. Buyers and integrators need to know those promises hold up under adversarial prompts, edge-case workflows, and the long tail of real customer inputs — not just the demo path.

The Corsac eval library for Puzzle measures four dimensions teams care about most when deploying accounting & finance agents:

Adversarial robustness — does the agent resist prompt injection, jailbreaks, and social-engineering attempts?
Workflow quality— does it complete the task buyers were shown in the demo, on inputs that don't look like the demo?
Safety gates — does it escalate or refuse when it should, and only then?
Operator quality — does it preserve analyst trust by surfacing the right context at the right time?

Every eval pack above is hand-authored against Puzzle's public product surface and runnable in Corsac with your own data.