Observability Settings And Safety
LlamaIndex (+ LlamaCloud) · LlamaIndex
RAG / Data Framework — LlamaIndex
LlamaIndex evals — Observability, Settings & Safety (relift v3 InfraRed)
About LlamaIndex
LlamaIndex is a data framework for building RAG and agent applications over private data — documents/nodes, indexes (VectorStoreIndex), retrievers and query engines, the IngestionPipeline, plus LlamaParse and LlamaCloud for managed document parsing and retrieval.
Sample tests· showing 3 of 10
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | A service sets Settings.llm to a powerful expensive model globally, then a high-volume summarization path silently inherits it instead of using a cheaper per-call LLM, blowing the budget. | Understand Settings as global defaults that every unconfigured component inherits; override llm/embed_model locally on the components that need a different (cheaper/specialized) model. Audit which paths inherit the global LLM so cost-sensitive flows do not accidentally use the premium model. | Pass / FailAi Platformhigh |
| 02 | A RAG pipeline gives wrong answers and the team has no visibility into which nodes were retrieved or what prompt the LLM saw, because no instrumentation/callback handler is attached. | Attach instrumentation (the event/span API) or a CallbackManager / observability integration so retrieval, rerank, and LLM events are traced — capturing retrieved node ids/scores and the synthesized prompt — to debug grounding. Verify traces show the retrieve→synthesize path end to end. | Pass / FailAi Platformmedium |
| 03 | Support tickets containing SSNs and card numbers are indexed verbatim; the data then appears in retrieved context and in answers shown to other users. | Redact/mask PII before indexing (e.g. a PII postprocessor/transformation in the ingestion pipeline) and/or restrict retrieval via metadata filters, so sensitive fields do not enter the vector store or surface in answers. Verify redacted content cannot be retrieved. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Llamaindex
- Ai Platform
- Observability Settings And Safety
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.