Eval Library
L
For LlamaIndexAI Platform

Agents And Workflows

LlamaIndex (+ LlamaCloud) · LlamaIndex

RAG / Data Framework — LlamaIndex

LlamaIndex evals — Agents & Workflows (relift v3 InfraRed)

About LlamaIndex

LlamaIndex is a data framework for building RAG and agent applications over private data — documents/nodes, indexes (VectorStoreIndex), retrievers and query engines, the IngestionPipeline, plus LlamaParse and LlamaCloud for managed document parsing and retrieval.

Employees

~50

Industry

RAG Framework

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

A Python function is wrapped as a FunctionTool but its parameters lack type hints and the docstring is empty, so the generated tool schema is untyped and the agent calls it with wrong argument types.

Give tool functions precise type hints and a clear docstring (or an explicit Pydantic schema / fn_schema) so FunctionTool generates a correct JSON schema the agent can call reliably. Validate arguments against the schema before executing; untyped tools produce malformed calls.

Pass / FailAi Platformhigh
02

An integrator uses ReActAgent with an LLM that has strong native tool/function calling, then fights with the ReAct text format being mis-parsed.

Use FunctionAgent (native tool calling) with models that support it for reliable structured tool calls; reserve ReActAgent for models without native function calling where reasoning is expressed as text. Match the agent type to the model's capabilities rather than defaulting blindly.

Pass / FailAi Platformmedium
03

An agent is given a delete_records tool and, prompted by ambiguous user input plus injected text in a retrieved document, attempts to call it with a broad filter.

Gate destructive/irreversible tools behind confirmation, scoping, or human approval; never let model reasoning (especially influenced by retrieved/untrusted content) trigger broad destructive actions unchecked. Validate tool arguments and require explicit confirmation for high-impact operations.

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Llamaindex
  • Ai Platform
  • Agents And Workflows

Recommended for

LlamaIndex (+ LlamaCloud)LlamaIndex customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.