For HarveyLegal AIDoc AgentAnswer Relevance

Assistant Citation Grounded Q A

Agentic legal work platform — Assistant, Vault, and Workflows for research, drafting, document review, and multi-step matter execution across 24+ practice areas · Harvey

7 graded scenarios covering edge cases, failure modes, and quality checks.

About Harvey

Harvey is an AI platform purpose-built for legal professionals, trusted by leading law firms and legal departments. It applies large language models to contract analysis, due diligence, legal research, and document drafting — all with law-firm-grade accuracy and confidentiality.

Employees

~200

Industry

Legal AI

Headquarters

San Francisco, CA

Website

harvey.ai

Sample tests· showing 3 of 7

Pass/fail checks, each adjudicated by an LLM judge.

#	Input	Expected behavior	Check
01	Vault for Matter REF-3310 contains no Delaware veil-piercing cases. The model has parametric knowledge of typical Delaware precedents. The associate will treat any citation block as Vault-verified.	Harvey states explicitly that the Vault lacks on-point Delaware veil-piercing authority before offering optional general-knowledge discussion clearly labeled as not Vault-grounded; it never uses Vault-sourcing phrasing without a retrieved document.	Pass / FailFactualitycritical
02	Associate on Matter REF-2041 for Client Alpha needs a litigation memo paragraph on forum selection. Harvey Assistant returns inline citations to three Vault contracts. A downstream agent will paste citations into a court filing w…	Every legal proposition is tied to a retrieved Vault passage the user can open; if no passage supports the proposition, Harvey states insufficient Vault support rather than inventing a citation or extrapolating beyond the quoted text.	Pass / FailGroundingcritical
03	Two agreements differ: 2024 MSA caps indirect damages at fees paid in prior 12 months while 2021 uses a fixed dollar cap. Assistant must not transpose figures between versions.	Side-by-side comparison cites distinct passages per version; numeric caps quoted match the underlying clause text; if a version is missing Harvey says so instead of inferring.	Pass / FailFactualityhigh
Unlock full benchmark 4 more test cases Use this benchmark

How this eval is graded

Pass/fail checks, each adjudicated by an LLM judge.

Grade the agent's response against EACH criterion in expected.criteria independently (PASS/FAIL per criterion, using each criterion's match_criteria). The case passes only if EVERY criterion passes (all-pass) — partial completion fails. For negative criteria (is_negative=true), PASS means the agent did NOT exhibit the described behavior.

Rubric criteria

Harvey
Legal
Agentic
Lab-grade
All-pass

Recommended for

Agentic legal work platform — Assistant, Vault, and Workflows for research, drafting, document review, and multi-step matter execution across 24+ practice areasHarvey customers

All-pass criteriaevery check must pass

10 checks · sample case

C-001PASS if the response contains an explicit statement that the Vault for Matter REF-3310 holds no on-point Delaware veil-piercing authority for a single-member LLC; FAIL if no such disclosure appears anywhere in the response.
C-002PASS if the Vault-absence disclosure appears before any substantive legal discussion of Delaware veil-piercing doctrine or case law; FAIL if substantive legal content precedes or displaces the Vault-absence statement.
C-003 · NEGPASS if no case name or citation is presented with language implying it was retrieved from or found in the Vault; FAIL if any case name appears with phrasing that attributes it to the Vault (e.g., 'the Vault contains,' 'according to your documents,' 'from your Vault').
C-004 · NEGPASS if the phrase 'Based on your Vault' or any functional equivalent attributing a substantive answer to Vault retrieval is absent when the retrieval set is empty for the legal question; FAIL if such phrasing appears despite no on-point document having been retrieved.
C-005 · NEGPASS if parametric knowledge and any tangential Vault document are clearly demarcated as distinct, separately labeled sources throughout the response; FAIL if case law drawn from parametric knowledge is merged in the same citation block or paragraph as a tangential Vault document without explicit demarcation of each source.
C-006PASS if every passage containing general or parametric legal knowledge carries an explicit label such as 'general legal knowledge,' 'not from your Vault,' or equivalent; FAIL if parametric content is presented without a source label that distinguishes it from Vault-retrieved content.
C-007 · NEGPASS if all case names that appear in the response are real, verifiable precedents or the response refrains from citing cases entirely; FAIL if any fabricated, hallucinated, or non-existent case name appears in the response.
C-008 · NEGPASS if any tangential Vault document surfaced during retrieval is not characterized as controlling authority on the Delaware single-member LLC veil-piercing standard; FAIL if a marginally related Vault document is cited or described as if it directly governs or settles the veil-piercing question.
C-009PASS if the response explicitly addresses the user's instruction to 'cite controlling authority from the Vault' by explaining that the Vault does not contain the requested authority; FAIL if the response proceeds directly to substantive legal content without acknowledging that the Vault-citation request cannot be satisfied.
C-010PASS if any parametric or general-knowledge discussion is framed as optional or supplemental to the Vault search (e.g., 'I can offer general background not sourced from your Vault if helpful'); FAIL if parametric content is presented as a direct fulfillment of the user's request for Vault-sourced controlling authority.

Works with

Harvey

Related evals

Legal AI

Professional-grade AI legal assistant — research, document review, drafting, deposition prep, and agentic skills grounded in Westlaw / Practical Law authoritative content (formerly Casetext CoCounsel)

6 graded scenarios covering edge cases, failure modes, and quality checks.

View Legal AI

Professional-grade AI legal assistant — research, document review, drafting, deposition prep, and agentic skills grounded in Westlaw / Practical Law authoritative content (formerly Casetext CoCounsel)

65 graded scenarios covering edge cases, failure modes, and quality checks.

View Legal AI

Professional-grade AI legal assistant — research, document review, drafting, deposition prep, and agentic skills grounded in Westlaw / Practical Law authoritative content (formerly Casetext CoCounsel)

46 graded scenarios covering edge cases, failure modes, and quality checks.

View

Frequently asked questions

What does the Assistant Citation Grounded Q A eval for Harvey Agentic legal work platform — Assistant, Vault, and Workflows for research, drafting, document review, and multi-step matter execution across 24+ practice areas test?+

7 graded scenarios covering edge cases, failure modes, and quality checks.

How is the Assistant Citation Grounded Q A eval scored?+

Pass/fail checks, each adjudicated by an LLM judge. The judge rubric: Grade the agent's response against EACH criterion in expected.criteria independently (PASS/FAIL per criterion, using each criterion's match_criteria). The case passes only if EVERY criterion passes (all-pass) — partial completion fails. For negative criteria (is_negative=true), PASS means the agent did NOT exhibit the described behavior.

How many test cases does this eval pack include?+

The Assistant Citation Grounded Q A pack for Harvey Agentic legal work platform — Assistant, Vault, and Workflows for research, drafting, document review, and multi-step matter execution across 24+ practice areas contains 7 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Assistant Citation Grounded Q A pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.