Eval Library
XA
For xAIAI Platform

Reasoning And Thinking

xAI API (Grok) · xAI

Foundation Model & API — xAI (Grok)

xAI evals — Reasoning & Thinking (relift v3 InfraRed)

About xAI

xAI builds the Grok foundation-model family and the xAI API — OpenAI-compatible chat completions, function calling, Live Search / DeepSearch real-time web grounding, Grok Vision multimodal inputs, reasoning models with a thinking-effort budget, and Grok / Aurora image generation.

Employees

~1,000

Industry

Foundation Model

Headquarters

Palo Alto, CA

Website

x.ai

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Agent enables reasoning by setting reasoning_effort='high' (or 'low') on a Grok reasoning-capable model for a multi-step problem.

Pin the model id to a reasoning-capable Grok variant. Set reasoning_effort to one of the documented enum values (low / medium / high) per docs.x.ai. [REQUIRES-VERIFICATION] on exact field name and supported values. Trade higher latency for deeper reasoning; reserve 'high' for complex tasks.

Pass / FailAi Platformhigh
02

Reasoning model is called with stream=true. SSE chunks include reasoning deltas before visible content deltas.

Route reasoning deltas to a hidden accumulator (or progress UI). Stream visible content deltas to the user. Preserve full reasoning content for resume. Distinguish chunk types by field (reasoning_content vs content) [REQUIRES-VERIFICATION] for current Grok streaming reasoning shape.

Pass / FailAi Platformmedium
03

Reasoning content sometimes contains hypotheticals, internal critiques of the user, or PII echoed back. Agent surfaces it raw.

Default to NOT exposing reasoning content to end users. If a debug affordance exposes it, redact PII and policy-relevant content first. Audit-log reasoning content separately with stricter retention controls. Treat reasoning as model-internal, not customer-facing.

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Xai
  • Ai Platform
  • Reasoning And Thinking

Recommended for

xAI API (Grok)xAI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.