Eval Library
D
For DeepSeekAI Platform

Reasoning Model Deepseek Reasoner

DeepSeek API · DeepSeek

Foundation Model & API — DeepSeek

DeepSeek evals — Reasoning Model (deepseek-reasoner) (relift v3 InfraRed)

About DeepSeek

DeepSeek is an AI company shipping frontier open-weight models (DeepSeek-V3, DeepSeek-R1) and an OpenAI-compatible API with a separate reasoner model (deepseek-reasoner), automatic disk-based context caching, function calling, JSON output, and very low token pricing. The models are released under an MIT license alongside the hosted API.

Employees

~200

Industry

Foundation Model

Headquarters

Hangzhou, China

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

A UI renders deepseek-reasoner output by concatenating reasoning_content and content into one block shown to the end user.

Render content as the user-facing answer and treat reasoning_content (the chain-of-thought) as separate — surface it only behind an optional 'show reasoning' affordance. Do not merge the two fields into the displayed answer.

Pass / FailAi Platformhigh
02

For multi-turn deepseek-reasoner chat, the client appends the full prior assistant message — including reasoning_content — into messages[] for the next request.

When constructing the next request, include only the prior assistant content (the final answer) and strip reasoning_content from the message before resending. Per docs, reasoning_content must not be passed back into the messages history.

Pass / FailAi Platformcritical
03

Code shared between deepseek-chat and deepseek-reasoner passes temperature and top_p tuned for chat directly to deepseek-reasoner and treats any difference as a bug.

deepseek-reasoner does not honor the same sampling controls as deepseek-chat — certain parameters (e.g. temperature, top_p) are unsupported / ignored for the reasoner. Branch parameter construction by model and verify supported params against the reasoning-model docs rather than assuming chat parit…

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Deepseek
  • Ai Platform
  • Reasoning Model Deepseek Reasoner

Recommended for

DeepSeek APIDeepSeek customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.