Eval Library
L
For LlamaIndexAI Platform

Retrievers And Query Engines

LlamaIndex (+ LlamaCloud) · LlamaIndex

RAG / Data Framework — LlamaIndex

LlamaIndex evals — Retrievers & Query Engines (relift v3 InfraRed)

About LlamaIndex

LlamaIndex is a data framework for building RAG and agent applications over private data — documents/nodes, indexes (VectorStoreIndex), retrievers and query engines, the IngestionPipeline, plus LlamaParse and LlamaCloud for managed document parsing and retrieval.

Employees

~50

Industry

RAG Framework

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

A RetrieverQueryEngine is left at similarity_top_k=2 for a corpus where relevant evidence is routinely spread across 5-8 chunks; answers are incomplete.

Tune similarity_top_k to the evidence-spread of the corpus and the LLM's context budget — raise k (and optionally add a reranker to trim) when answers need more chunks, while watching context overflow and cost. Treat k as a measured parameter, not a default.

Pass / FailAi Platformmedium
02

A query engine returns a Response. The application renders response.response text to users as cited answers but never inspects response.source_nodes.

Use response.source_nodes (NodeWithScore) to attribute each answer to the retrieved nodes — render citations from node metadata (source, page) and verify the answer's claims are supported by those nodes. An answer with no/low-score source nodes should be flagged as ungrounded, not presented as cite…

Pass / FailAi Platformcritical
03

The integrator wants higher precision, so they set similarity_top_k=3 and add an LLMRerank node postprocessor expecting it to improve recall.

Retrieve a wider candidate set (higher similarity_top_k) and let the reranker (LLMRerank / SentenceTransformerRerank / Cohere rerank) cut to a smaller top_n. Reranking cannot recover documents the retriever never fetched — set retrieval k > rerank top_n.

Pass / FailAi Platformhigh

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Llamaindex
  • Ai Platform
  • Retrievers And Query Engines

Recommended for

LlamaIndex (+ LlamaCloud)LlamaIndex customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.