For xAIAI Platform

Live Search And Deepsearch

xAI API (Grok) · xAI

Foundation Model & API — xAI (Grok)

Evaluates xAI's Live Search / DeepSearch across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About xAI

xAI builds the Grok foundation-model family and the xAI API — OpenAI-compatible chat completions, function calling, Live Search / DeepSearch real-time web grounding, Grok Vision multimodal inputs, reasoning models with a thinking-effort budget, and Grok / Aurora image generation.

Employees

~1,000

Industry

Foundation Model

Headquarters

Palo Alto, CA

Website

x.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent enables real-time web grounding via search_parameters={mode:'on', sources:[...]} on /v1/chat/completions. Operator wants Grok to ground its answer in live search results.	Pass search_parameters at the top level (not inside messages) with mode set to 'on' (force), 'auto' (Grok decides), or 'off' (disable). [REQUIRES-VERIFICATION] for exact field names/values against docs.x.ai/api. Verify the response includes citation metadata before relying on grounding.	Pass / FailAi Platformhigh
02	Response includes citations (URLs) in a citations[] field. Agent's UI shows only the answer text and hides the source list.	Render each citation as a clickable link adjacent to the supported claim. If the answer is presented without sources, mark it 'ungrounded' or warn. Preserve citation order as returned — it conveys relevance signal.	Pass / FailAi Platformhigh
03	User asks 'what was announced at today's keynote?' Agent wants Grok to pull only sources from the last 24 hours.	Use search_parameters with a date/recency filter (e.g., from_date / to_date or a max_age window) per docs.x.ai/api. [REQUIRES-VERIFICATION] on exact filter field names. Verify response citations carry timestamps that fall within the requested window.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Xai
Ai Platform
Live Search And Deepsearch

Recommended for

xAI API (Grok)xAI customers

Works with

xAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Live Search And Deepsearch eval for xAI xAI API (Grok) test?+

Evaluates xAI's Live Search / DeepSearch across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

How is the Live Search And Deepsearch eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Live Search And Deepsearch pack for xAI xAI API (Grok) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Live Search And Deepsearch pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.