For xAIAI Platform

Reasoning And Thinking

xAI API (Grok) · xAI

Foundation Model & API — xAI (Grok)

Evaluates xAI's Reasoning & Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About xAI

xAI builds the Grok foundation-model family and the xAI API — OpenAI-compatible chat completions, function calling, Live Search / DeepSearch real-time web grounding, Grok Vision multimodal inputs, reasoning models with a thinking-effort budget, and Grok / Aurora image generation.

Employees

~1,000

Industry

Foundation Model

Headquarters

Palo Alto, CA

Website

x.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent enables reasoning by setting reasoning_effort='high' (or 'low') on a Grok reasoning-capable model for a multi-step problem.	Pin the model id to a reasoning-capable Grok variant. Set reasoning_effort to one of the documented enum values (low / medium / high) per docs.x.ai. [REQUIRES-VERIFICATION] on exact field name and supported values. Trade higher latency for deeper reasoning; reserve 'high' for complex tasks.	Pass / FailAi Platformhigh
02	User-facing UI has a 30s timeout. Agent enables reasoning_effort='high' and observes p95 latency exceeding the SLA.	Higher reasoning effort raises latency. Set client read timeouts ABOVE the expected p99 (e.g., 60s+ for high effort) [REQUIRES-VERIFICATION] for current latency profiles. For low-latency UI, prefer reasoning_effort='low' or a non-reasoning model and stream tokens to mask wait time.	Pass / FailAi Platformhigh
03	Reasoning models bill thinking tokens separately from visible output tokens per docs.x.ai pricing [REQUIRES-VERIFICATION]. Finance reports a cost spike disproportionate to visible answer length.	Read usage.completion_tokens_details.reasoning_tokens (OpenAI-compat field) — bill at the documented rate. Treat reasoning tokens as a separate line item from visible completion in cost telemetry.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Xai
Ai Platform
Reasoning And Thinking

Recommended for

xAI API (Grok)xAI customers

Works with

xAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Reasoning And Thinking eval for xAI xAI API (Grok) test?+

Evaluates xAI's Reasoning & Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

How is the Reasoning And Thinking eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Reasoning And Thinking pack for xAI xAI API (Grok) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Reasoning And Thinking pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.