Eval Library
OpenAI
For OpenAIAI Platform

Responses And Chat Completions

OpenAI API · OpenAI

Foundation Model & API — OpenAI (GPT)

OpenAI evals — Responses & Chat Completions (relift v3 InfraRed)

About OpenAI

OpenAI builds the GPT model family and the OpenAI API — Responses and Chat Completions, function calling, Structured Outputs, embeddings, fine-tuning, the Batch API, moderation, the Realtime API, and the Agents SDK — used by developers to build AI products at scale.

Employees

~3,000

Industry

Foundation Model

Headquarters

San Francisco, CA

Website

openai.com

Sample tests· showing 3 of 10

#InputExpected behaviorCheck
01

Agent streams /v1/chat/completions with max_tokens=256 for a long answer; the final SSE chunk has finish_reason='length'.

Detect finish_reason='length' on the terminal chunk and treat the answer as truncated — either raise a partial-completion to the caller or continue by appending the assistant text and re-requesting. Never present truncated output as complete.

Pass / FailAi Platformcritical
02

Responses API multi-turn flow stores previous_response_id from turn 1 and references it on turn 2; the referenced response has since been deleted/expired.

Handle the missing-reference error by rebuilding the conversation from the integrator's own message store rather than retrying verbatim; do not assume server-side state is durable.

Pass / FailAi Platformhigh
03

Agent sets n=4 on chat completions to get candidate diversity but reports cost as a single completion.

Account for n completions: output tokens are billed per candidate. Surface all n choices or pick one deterministically, and attribute cost as sum across choices.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Openai
  • Ai Platform
  • Responses And Chat Completions

Recommended for

OpenAI APIOpenAI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.