For LlamaIndexAI Platform

Structured Outputs And Extraction

LlamaIndex (+ LlamaCloud) · LlamaIndex

RAG / Data Framework — LlamaIndex

Evaluates LlamaIndex's Structured Outputs & Extraction across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's RAG / Data Framework eval coverage.

About LlamaIndex

LlamaIndex is a data framework for building RAG and agent applications over private data — documents/nodes, indexes (VectorStoreIndex), retrievers and query engines, the IngestionPipeline, plus LlamaParse and LlamaCloud for managed document parsing and retrieval.

Employees

~50

Industry

RAG Framework

Headquarters

San Francisco, CA

Website

www.llamaindex.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	A query engine is built with output_cls=InvoiceSummary (a Pydantic model). When the LLM returns JSON missing a required field, the integrator catches the ValidationError and returns an empty object as success.	On Pydantic validation failure from output_cls/structured_predict, do not silently substitute an empty/partial object as success — retry with a corrective prompt, surface the validation error, or route to review. The schema contract must hold or the failure must be visible.	Pass / FailAi Platformhigh
02	Instead of llm.structured_predict / as_structured_llm with a Pydantic model, the integrator prompts for JSON in free text and json.loads the completion, which breaks when the model wraps JSON in prose or code fences.	Use LlamaIndex's structured prediction (structured_predict / as_structured_llm / Pydantic program) so parsing is schema-driven and robust to formatting, rather than json.loads on free-form text. This leverages native structured output / tool schemas where the model supports them.	Pass / FailAi Platformmedium
03	An extraction pipeline pulls 'effective_date' from contracts. For a contract that has no effective date, the model confidently returns a plausible date not present in the document.	Constrain extraction to values present in the source: allow null/'not found' for absent fields, and verify extracted values against the source text (e.g. via source nodes / citations) rather than accepting hallucinated-but-plausible values. An unfound field must be null, not invented.	Pass / FailAi Platformcritical
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Llamaindex
Ai Platform
Structured Outputs And Extraction

Recommended for

LlamaIndex (+ LlamaCloud)LlamaIndex customers

Works with

LlamaIndex

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Structured Outputs And Extraction eval for LlamaIndex LlamaIndex (+ LlamaCloud) test?+

How is the Structured Outputs And Extraction eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Structured Outputs And Extraction pack for LlamaIndex LlamaIndex (+ LlamaCloud) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Structured Outputs And Extraction pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.