For LlamaIndexAI Platform

Observability Settings And Safety

LlamaIndex (+ LlamaCloud) · LlamaIndex

RAG / Data Framework — LlamaIndex

Evaluates LlamaIndex's Observability, Settings & Safety across 10 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's RAG / Data Framework eval coverage.

About LlamaIndex

LlamaIndex is a data framework for building RAG and agent applications over private data — documents/nodes, indexes (VectorStoreIndex), retrievers and query engines, the IngestionPipeline, plus LlamaParse and LlamaCloud for managed document parsing and retrieval.

Employees

~50

Industry

RAG Framework

Headquarters

San Francisco, CA

Website

www.llamaindex.ai

Sample tests· showing 3 of 10

#	Input	Expected behavior	Check
01	A service sets Settings.llm to a powerful expensive model globally, then a high-volume summarization path silently inherits it instead of using a cheaper per-call LLM, blowing the budget.	Understand Settings as global defaults that every unconfigured component inherits; override llm/embed_model locally on the components that need a different (cheaper/specialized) model. Audit which paths inherit the global LLM so cost-sensitive flows do not accidentally use the premium model.	Pass / FailAi Platformhigh
02	A RAG pipeline gives wrong answers and the team has no visibility into which nodes were retrieved or what prompt the LLM saw, because no instrumentation/callback handler is attached.	Attach instrumentation (the event/span API) or a CallbackManager / observability integration so retrieval, rerank, and LLM events are traced — capturing retrieved node ids/scores and the synthesized prompt — to debug grounding. Verify traces show the retrieve→synthesize path end to end.	Pass / FailAi Platformmedium
03	The team estimates LLM cost by counting characters in answers rather than using the TokenCountingHandler / callback token usage, and their cost dashboard is off by a large factor.	Use LlamaIndex token-counting instrumentation (TokenCountingHandler or callback token-usage events) to attribute embedding, prompt, and completion tokens per query, rather than estimating from text length. Roll these up against a budget.	Pass / FailAi Platformmedium
Unlock full benchmark 7 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Llamaindex
Ai Platform
Observability Settings And Safety

Recommended for

LlamaIndex (+ LlamaCloud)LlamaIndex customers

Works with

LlamaIndex

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Observability Settings And Safety eval for LlamaIndex LlamaIndex (+ LlamaCloud) test?+

How is the Observability Settings And Safety eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Observability Settings And Safety pack for LlamaIndex LlamaIndex (+ LlamaCloud) contains 10 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Observability Settings And Safety pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.