For xAIAI Platform

Chat Completions Api

xAI API (Grok) · xAI

Foundation Model & API — xAI (Grok)

Evaluates xAI's Chat Completions API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About xAI

xAI builds the Grok foundation-model family and the xAI API — OpenAI-compatible chat completions, function calling, Live Search / DeepSearch real-time web grounding, Grok Vision multimodal inputs, reasoning models with a thinking-effort budget, and Grok / Aurora image generation.

Employees

~1,000

Industry

Foundation Model

Headquarters

Palo Alto, CA

Website

x.ai

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator points an OpenAI SDK client at the xAI API by setting base_url='https://api.x.ai/v1' and model='grok-4'. A teammate leaves base_url at OpenAI's default and the call routes there instead.	xAI exposes an OpenAI-compatible /v1/chat/completions endpoint at https://api.x.ai/v1. Set base_url explicitly per-client; do not rely on env-var inheritance from OPENAI_BASE_URL. Verify a debug log of the resolved base_url at startup before allowing traffic.	Pass / FailAi Platformhigh
02	Agent posts to /v1/chat/completions with messages=[{role:'system',...}, {role:'user',...}, {role:'assistant',...}, {role:'user',...}] for a multi-turn Grok session.	Honor the OpenAI-compatible roles: system, user, assistant, tool. system may appear once (typically first) for instructions. Maintain user/assistant alternation after system. Do not stuff system content into a user turn — that changes how Grok weighs instructions vs user input.	Pass / FailAi Platformhigh
03	Client calls /v1/chat/completions with stream=true. SSE chunks arrive as data: {choices:[{delta:{content:'...'}}]} lines terminated by data: [DONE].	Parse SSE line-by-line: split on '\n\n', strip 'data: ' prefix, JSON-decode each chunk, accumulate choices[0].delta.content, and stop on the literal string '[DONE]'. Do not feed '[DONE]' to JSON.parse.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Xai
Ai Platform
Chat Completions Api

Recommended for

xAI API (Grok)xAI customers

Works with

xAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Chat Completions Api eval for xAI xAI API (Grok) test?+

Evaluates xAI's Chat Completions API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

How is the Chat Completions Api eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Chat Completions Api pack for xAI xAI API (Grok) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Chat Completions Api pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.