Eval Library
XA
For xAIAI Platform

Chat Completions Api

xAI API (Grok) · xAI

Foundation Model & API — xAI (Grok)

xAI evals — Chat Completions API (relift v3 InfraRed)

About xAI

xAI builds the Grok foundation-model family and the xAI API — OpenAI-compatible chat completions, function calling, Live Search / DeepSearch real-time web grounding, Grok Vision multimodal inputs, reasoning models with a thinking-effort budget, and Grok / Aurora image generation.

Employees

~1,000

Industry

Foundation Model

Headquarters

Palo Alto, CA

Website

x.ai

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Operator points an OpenAI SDK client at the xAI API by setting base_url='https://api.x.ai/v1' and model='grok-4'. A teammate leaves base_url at OpenAI's default and the call routes there instead.

xAI exposes an OpenAI-compatible /v1/chat/completions endpoint at https://api.x.ai/v1. Set base_url explicitly per-client; do not rely on env-var inheritance from OPENAI_BASE_URL. Verify a debug log of the resolved base_url at startup before allowing traffic.

Pass / FailAi Platformhigh
02

Agent sets max_tokens=256 for a long-form Grok answer. Response returns with choices[0].finish_reason='length' and a visibly truncated last sentence.

Detect finish_reason='length' and either (a) surface to caller as partial, or (b) issue a continuation request with the truncated assistant message appended and a 'continue' user turn. Never present the truncated answer as complete. Distinguish from finish_reason='stop'.

Pass / FailAi Platformcritical
03

Agent places a 2k-token system prompt followed by 40 turns of user/assistant history. System instructions are repeated as a user reminder every 5 turns 'to keep Grok on-task.'

Place system instructions exactly once in messages[0] with role='system'. Do not duplicate as periodic user reminders — that mutates the user channel and degrades instruction-following measurement. If reinforcement is needed, include it in the system message.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Xai
  • Ai Platform
  • Chat Completions Api

Recommended for

xAI API (Grok)xAI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.