For DeepSeekAI Platform

Reasoning Model Deepseek Reasoner

DeepSeek API · DeepSeek

Foundation Model & API — DeepSeek

Evaluates DeepSeek's Reasoning Model (deepseek-reasoner) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

About DeepSeek

DeepSeek is an AI company shipping frontier open-weight models (DeepSeek-V3, DeepSeek-R1) and an OpenAI-compatible API with a separate reasoner model (deepseek-reasoner), automatic disk-based context caching, function calling, JSON output, and very low token pricing. The models are released under an MIT license alongside the hosted API.

Employees

~200

Industry

Foundation Model

Headquarters

Hangzhou, China

Website

www.deepseek.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	A UI renders deepseek-reasoner output by concatenating reasoning_content and content into one block shown to the end user.	Render content as the user-facing answer and treat reasoning_content (the chain-of-thought) as separate — surface it only behind an optional 'show reasoning' affordance. Do not merge the two fields into the displayed answer.	Pass / FailAi Platformhigh
02	For multi-turn deepseek-reasoner chat, the client appends the full prior assistant message — including reasoning_content — into messages[] for the next request.	When constructing the next request, include only the prior assistant content (the final answer) and strip reasoning_content from the message before resending. Per docs, reasoning_content must not be passed back into the messages history.	Pass / FailAi Platformcritical
03	deepseek-reasoner is streamed. Chunks first carry delta.reasoning_content, then later carry delta.content. The client appends every delta into one buffer.	Accumulate delta.reasoning_content into a reasoning buffer and delta.content into the answer buffer separately, switching as the stream transitions. Keep the two streams distinct so the user-visible answer excludes the CoT.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Deepseek
Ai Platform
Reasoning Model Deepseek Reasoner

Recommended for

DeepSeek APIDeepSeek customers

Works with

DeepSeek

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Reasoning Model Deepseek Reasoner eval for DeepSeek DeepSeek API test?+

How is the Reasoning Model Deepseek Reasoner eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Reasoning Model Deepseek Reasoner pack for DeepSeek DeepSeek API contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Reasoning Model Deepseek Reasoner pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.