For LlamaIndexAI Platform

Agents And Workflows

LlamaIndex (+ LlamaCloud) · LlamaIndex

RAG / Data Framework — LlamaIndex

Evaluates LlamaIndex's Agents & Workflows across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's RAG / Data Framework eval coverage.

About LlamaIndex

LlamaIndex is a data framework for building RAG and agent applications over private data — documents/nodes, indexes (VectorStoreIndex), retrievers and query engines, the IngestionPipeline, plus LlamaParse and LlamaCloud for managed document parsing and retrieval.

Employees

~50

Industry

RAG Framework

Headquarters

San Francisco, CA

Website

www.llamaindex.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	A Python function is wrapped as a FunctionTool but its parameters lack type hints and the docstring is empty, so the generated tool schema is untyped and the agent calls it with wrong argument types.	Give tool functions precise type hints and a clear docstring (or an explicit Pydantic schema / fn_schema) so FunctionTool generates a correct JSON schema the agent can call reliably. Validate arguments against the schema before executing; untyped tools produce malformed calls.	Pass / FailAi Platformhigh
02	An integrator uses ReActAgent with an LLM that has strong native tool/function calling, then fights with the ReAct text format being mis-parsed.	Use FunctionAgent (native tool calling) with models that support it for reliable structured tool calls; reserve ReActAgent for models without native function calling where reasoning is expressed as text. Match the agent type to the model's capabilities rather than defaulting blindly.	Pass / FailAi Platformmedium
03	A FunctionTool calls an external API that raises a 503. The wrapper lets the exception propagate and crash the agent.run() loop instead of returning a tool result.	Catch tool execution failures and return them to the agent as a tool output describing the error (so the model can retry, choose another tool, or report), rather than letting the exception abort the agent loop. Cap retries to avoid a tool-call storm.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Llamaindex
Ai Platform
Agents And Workflows

Recommended for

LlamaIndex (+ LlamaCloud)LlamaIndex customers

Works with

LlamaIndex

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Agents And Workflows eval for LlamaIndex LlamaIndex (+ LlamaCloud) test?+

Evaluates LlamaIndex's Agents & Workflows across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's RAG / Data Framework eval coverage.

How is the Agents And Workflows eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Agents And Workflows pack for LlamaIndex LlamaIndex (+ LlamaCloud) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Agents And Workflows pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.