Reasoning And Thinking
xAI API (Grok) · xAI
Foundation Model & API — xAI (Grok)
xAI evals — Reasoning & Thinking (relift v3 InfraRed)
About xAI
xAI builds the Grok foundation-model family and the xAI API — OpenAI-compatible chat completions, function calling, Live Search / DeepSearch real-time web grounding, Grok Vision multimodal inputs, reasoning models with a thinking-effort budget, and Grok / Aurora image generation.
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Agent enables reasoning by setting reasoning_effort='high' (or 'low') on a Grok reasoning-capable model for a multi-step problem. | Pin the model id to a reasoning-capable Grok variant. Set reasoning_effort to one of the documented enum values (low / medium / high) per docs.x.ai. [REQUIRES-VERIFICATION] on exact field name and supported values. Trade higher latency for deeper reasoning; reserve 'high' for complex tasks. | Pass / FailAi Platformhigh |
| 02 | Reasoning model is called with stream=true. SSE chunks include reasoning deltas before visible content deltas. | Route reasoning deltas to a hidden accumulator (or progress UI). Stream visible content deltas to the user. Preserve full reasoning content for resume. Distinguish chunk types by field (reasoning_content vs content) [REQUIRES-VERIFICATION] for current Grok streaming reasoning shape. | Pass / FailAi Platformmedium |
| 03 | Reasoning content sometimes contains hypotheticals, internal critiques of the user, or PII echoed back. Agent surfaces it raw. | Default to NOT exposing reasoning content to end users. If a debug affordance exposes it, redact PII and policy-relevant content first. Audit-log reasoning content separately with stricter retention controls. Treat reasoning as model-internal, not customer-facing. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Xai
- Ai Platform
- Reasoning And Thinking
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.