For SpellbookLegal AIDoc Agent

Review Redline Risk Spotting And Tracked Changes Generation

Generative AI for transactional lawyers in Microsoft Word — contract drafting, review, redlining, and the agentic Spellbook Associate workflow · Spellbook

51 graded scenarios covering edge cases, failure modes, and quality checks.

About Spellbook

Spellbook is a legal-specific AI assistant for contract review, drafting, redlining, and document questions. Its workflows are delivered through a Microsoft Word add-in.

Industry

Legal Technology / Contract AI

Website

www.spellbook.legal

Sample tests· showing 3 of 51

Pass/fail checks, each adjudicated by an LLM judge.

#	Input	Expected behavior	Check
01	Section 12 contains an aggregate LOL cap: 'Vendor's liability shall not exceed $50,000 in aggregate for all claims.' Section 9 (indemnification) reads: 'Client shall indemnify, defend, and hold harmless Vendor from and against an…	Agent: (1) flags the $50,000 LOL cap as a risk item; (2) separately flags Section 9 as a first-party indemnification—covering Vendor's own direct losses rather than third-party claims asserted against Vendor—citing the absence of any 'third-party claim' qualifier; (3) explicitly states that the fir…	Pass / FailFactualitycritical
02	The 40-page MSA contains two numerically phrased liability provisions: Section 11.2 (liquidated damages for late delivery: 'not to exceed 5% of the affected monthly fees per week of delay') and Section 12.1 (LOL cap: 'Vendor's ag…	Agent produces a risk item explicitly labeled 'Limitation of Liability' (or equivalent) that: (1) quotes or paraphrases the 12-month trailing-fees formula from Section 12.1, (2) identifies it as an LOL cap applied to Vendor's aggregate liability, (3) surfaces it as a distinct risk item separate fro…	Pass / FailFactualitycritical
03	Section 14.1: 'Either party may terminate this Agreement upon written notice if the other party materially breaches this Agreement and fails to cure such breach within thirty (30) days of written notice.' Section 14.2: 'Vendor ma…	Agent flags Section 14.2 as a unilateral termination-for-convenience right held by Vendor only. States that Client has no corresponding termination-for-convenience right and can only terminate for cause under Section 14.1. Names Vendor as the party holding the Section 14.2 right and Client as the p…	Pass / FailFactualitycritical
Unlock full benchmark 48 more test cases Use this benchmark

How this eval is graded

Pass/fail checks, each adjudicated by an LLM judge.

Grade the agent's response against EACH criterion in expected.criteria independently (PASS/FAIL per criterion, using each criterion's match_criteria). The case passes only if EVERY criterion passes (all-pass) — partial completion fails. For negative criteria (is_negative=true), PASS means the agent did NOT exhibit the described behavior.

Rubric criteria

Spellbook
Legal
Agentic
Lab-grade
All-pass

Recommended for

Generative AI for transactional lawyers in Microsoft Word — contract drafting, review, redlining, and the agentic Spellbook Associate workflowSpellbook customers

All-pass criteriaevery check must pass

10 checks · sample case

C-001PASS if the response flags the Section 12 aggregate limitation-of-liability cap ($50,000 aggregate for all claims) as a risk or adverse provision for the Client. FAIL if the $50,000 LOL cap is not flagged as a risk item.
C-002PASS if the response characterizes Section 9 as a first-party indemnification — i.e., covering Vendor's own direct losses arising from Client's conduct rather than (or in addition to) third-party claims asserted against Vendor. FAIL if Section 9 is not identified as first-party in scope.
C-003PASS if the response explicitly notes that Section 9 contains no 'third-party claim' qualifier (or equivalent language limiting the indemnity to claims brought by third parties) as the textual basis for the first-party reading. FAIL if the missing third-party limiter is never cited as support.
C-004PASS if the response explicitly states that the first-party indemnification in Section 9 undermines or effectively defeats the Section 12 LOL cap by giving Vendor an unlimited direct-recovery path against Client outside the cap. FAIL if this interaction between the two sections is not explicitly articulated.
C-005 · NEGPASS if the response never categorizes Section 9 as a standard, customary, or routine third-party indemnification clause. FAIL if Section 9 is described as 'standard indemnification — third-party claims' or any equivalent characterization implying it only covers third-party claims.
C-006 · NEGPASS if the response cross-references Section 9 and Section 12 as interacting risk items. FAIL if the LOL cap and the indemnification clause are flagged only as separate, standalone risk items with no mention of how one affects the other.
C-007 · NEGPASS if the response identifies the structural risk (first-party framing creating an uncapped recovery path) rather than only complaining that Section 9's triggers are broad. FAIL if the response characterizes the Section 9 risk solely as scope over-reach or breadth of triggers without identifying the LOL-bypass effect.
C-008 · NEGPASS if any redline or recommendation for Section 9 identifies the first-party framing (lack of a third-party-claim limiter) as the defect to fix, even if it also narrows triggers. FAIL if the response recommends narrowing the (a)-(c) triggers by subject matter while never identifying the first-party framing as the structural problem.
C-009 · NEGPASS if the response avoids asserting that the limitation-of-liability issues are 'flagged and addressed', 'resolved', or fully covered when the cap-defeating interaction has not been remediated. FAIL if the response marks the LOL section as addressed or the review as complete in a way that papers over the unresolved cap-bypass exposure.
C-010PASS if the risk analysis is framed from the Client's position (Vendor-favorable provisions flagged as adverse to Client). FAIL if any provision is assessed from the Vendor's perspective or a Vendor-favorable provision is described as favorable to the Client.

Works with

Spellbook

Related evals

Legal AI

Professional-grade AI legal assistant — research, document review, drafting, deposition prep, and agentic skills grounded in Westlaw / Practical Law authoritative content (formerly Casetext CoCounsel)

6 graded scenarios covering edge cases, failure modes, and quality checks.

View Legal AI

Professional-grade AI legal assistant — research, document review, drafting, deposition prep, and agentic skills grounded in Westlaw / Practical Law authoritative content (formerly Casetext CoCounsel)

65 graded scenarios covering edge cases, failure modes, and quality checks.

View Legal AI

Professional-grade AI legal assistant — research, document review, drafting, deposition prep, and agentic skills grounded in Westlaw / Practical Law authoritative content (formerly Casetext CoCounsel)

46 graded scenarios covering edge cases, failure modes, and quality checks.

View

Frequently asked questions

What does the Review Redline Risk Spotting And Tracked Changes Generation eval for Spellbook Generative AI for transactional lawyers in Microsoft Word — contract drafting, review, redlining, and the agentic Spellbook Associate workflow test?+

51 graded scenarios covering edge cases, failure modes, and quality checks.

How is the Review Redline Risk Spotting And Tracked Changes Generation eval scored?+

Pass/fail checks, each adjudicated by an LLM judge. The judge rubric: Grade the agent's response against EACH criterion in expected.criteria independently (PASS/FAIL per criterion, using each criterion's match_criteria). The case passes only if EVERY criterion passes (all-pass) — partial completion fails. For negative criteria (is_negative=true), PASS means the agent did NOT exhibit the described behavior.

How many test cases does this eval pack include?+

The Review Redline Risk Spotting And Tracked Changes Generation pack for Spellbook Generative AI for transactional lawyers in Microsoft Word — contract drafting, review, redlining, and the agentic Spellbook Associate workflow contains 51 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Review Redline Risk Spotting And Tracked Changes Generation pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.