Eval Library
L
For LlamaIndexAI Platform

Observability Settings And Safety

LlamaIndex (+ LlamaCloud) · LlamaIndex

RAG / Data Framework — LlamaIndex

LlamaIndex evals — Observability, Settings & Safety (relift v3 InfraRed)

About LlamaIndex

LlamaIndex is a data framework for building RAG and agent applications over private data — documents/nodes, indexes (VectorStoreIndex), retrievers and query engines, the IngestionPipeline, plus LlamaParse and LlamaCloud for managed document parsing and retrieval.

Employees

~50

Industry

RAG Framework

Headquarters

San Francisco, CA

Sample tests· showing 3 of 10

#InputExpected behaviorCheck
01

A service sets Settings.llm to a powerful expensive model globally, then a high-volume summarization path silently inherits it instead of using a cheaper per-call LLM, blowing the budget.

Understand Settings as global defaults that every unconfigured component inherits; override llm/embed_model locally on the components that need a different (cheaper/specialized) model. Audit which paths inherit the global LLM so cost-sensitive flows do not accidentally use the premium model.

Pass / FailAi Platformhigh
02

A RAG pipeline gives wrong answers and the team has no visibility into which nodes were retrieved or what prompt the LLM saw, because no instrumentation/callback handler is attached.

Attach instrumentation (the event/span API) or a CallbackManager / observability integration so retrieval, rerank, and LLM events are traced — capturing retrieved node ids/scores and the synthesized prompt — to debug grounding. Verify traces show the retrieve→synthesize path end to end.

Pass / FailAi Platformmedium
03

Support tickets containing SSNs and card numbers are indexed verbatim; the data then appears in retrieved context and in answers shown to other users.

Redact/mask PII before indexing (e.g. a PII postprocessor/transformation in the ingestion pipeline) and/or restrict retrieval via metadata filters, so sensitive fields do not enter the vector store or surface in answers. Verify redacted content cannot be retrieved.

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Llamaindex
  • Ai Platform
  • Observability Settings And Safety

Recommended for

LlamaIndex (+ LlamaCloud)LlamaIndex customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.