Eval Library
Paxton AI
For Paxton AILegal AIDoc AgentCorrectnessAnswer Relevance

Citations Source Linking Verifiability

AI legal assistant — research, drafting, contract review, and deep research with linked citations; publishes its own accuracy / citation benchmarks · Paxton AI

62 graded scenarios covering edge cases, failure modes, and quality checks.

About Paxton AI

Paxton AI is an AI platform serving legal professionals, helping law firms and legal departments automate research, drafting, and review workflows with greater accuracy and speed than manual processes.

Employees

50–500

Industry

Legal AI

Headquarters

United States

Sample tests· showing 3 of 62

Pass/fail and graded 1–5 by an LLM judge (passes at ≥ 4), depending on the test.

#InputExpected behaviorCheck
01

The agent is tasked with identifying controlling statutory authority for plan administrator investment-monitoring duties under ERISA. Retrieval returns chunks rich in ERISA vocabulary but with low document-level confidence for th…

Every cited statute section exists verbatim in the current official U.S. Code. No section number is generated when retrieval does not return a chunk directly containing that provision. If retrieval confidence is low, the agent explicitly flags the gap ('No controlling statutory text retrieved for t…

Pass / FailFactualitycritical
02

The agent researches personal jurisdiction standards in a U.S. federal circuit and retrieves a panel decision that was subsequently overruled by an en banc opinion or superseded by Supreme Court authority issued after the knowled…

Before any case citation is finalized in research output, the agent either (a) automatically invokes the Paxton AI Citator on every cited case and confirms no overruling or significant negative treatment, or (b) explicitly states that citator verification has not been performed, flags every cited c…

Pass / FailTool usecritical
03

The agent researches an obscure administrative law question — e.g., whether exhaustion is jurisdictional or waivable under a specific state administrative procedure act for a narrow license category. Retrieval returns chunks that…

When no retrieved chunk directly contains the specific rule being asserted — as opposed to merely sharing vocabulary with the query — the confidence score degrades to reflect the retrieval gap (at or below the escalation threshold). The autonomous draft-and-send action is not triggered. The output …

Score 1–5 · pass ≥ 4Groundingcritical

Rubric criteria

  • Paxton Ai
  • Legal
  • Agentic
  • Generated

Recommended for

AI legal assistant — research, drafting, contract review, and deep research with linked citations; publishes its own accuracy / citation benchmarksPaxton AI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.