For OpenAIAI PlatformAnswer Relevance

Embeddings And Retrieval

OpenAI API · OpenAI

Foundation Model & API — OpenAI (GPT)

Evaluates OpenAI's Embeddings & Retrieval across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About OpenAI

OpenAI builds the GPT model family and the OpenAI API — Responses and Chat Completions, function calling, Structured Outputs, embeddings, fine-tuning, the Batch API, moderation, the Realtime API, and the Agents SDK — used by developers to build AI products at scale.

Employees

~3,000

Industry

Foundation Model

Headquarters

San Francisco, CA

Website

openai.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Team uses text-embedding-3-large but sets dimensions=256 to save vector-store cost, then compares to vectors stored at full dimension.	All vectors in an index must share the same model and dimensions; re-embed the whole corpus when changing dimensions. Mixing dimensions makes cosine similarity meaningless.	Pass / FailAi Platformcritical
02	Retrieval ranks by dot product over un-normalized embeddings and gets inconsistent ordering.	OpenAI embeddings are normalized to length 1, so cosine == dot product; if the vector store re-scales or you mix sources, normalize consistently. Verify the distance metric matches the store config.	Pass / FailAi Platformmedium
03	Agent embeds 50k chunks one HTTP request at a time and hits rate limits.	Batch multiple inputs per /v1/embeddings call (array input) and/or use the Batch API for large backfills; respect token-per-request limits. Map outputs back to inputs by index order, which is preserved.	Pass / FailAi Platformmedium
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Openai
Ai Platform
Embeddings And Retrieval

Recommended for

OpenAI APIOpenAI customers

Works with

OpenAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Embeddings And Retrieval eval for OpenAI OpenAI API test?+

Evaluates OpenAI's Embeddings & Retrieval across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

How is the Embeddings And Retrieval eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Embeddings And Retrieval pack for OpenAI OpenAI API contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Embeddings And Retrieval pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.