Eval Library
C
For CohereAI Platform

Chat Api And Streaming

Cohere API · Cohere

Foundation Model & API — Cohere

Cohere evals — Chat API & Streaming (relift v3 InfraRed)

About Cohere

Cohere builds enterprise foundation models and the tools around them — the Command model family, best-in-class Rerank and Embed endpoints, and grounded retrieval-augmented generation with inline citations — deployable across major clouds and private VPCs.

Employees

~400

Industry

Foundation Model

Headquarters

Toronto, Canada

Website

cohere.com

Sample tests· showing 3 of 10

#InputExpected behaviorCheck
01

Agent builds a /v2/chat request with messages[] where it places a system instruction as a trailing message with role='user' instead of role='system', expecting it to behave like a preamble.

Use the documented v2 message roles: a message with role='system' for instructions/preamble, then alternating role='user'/role='assistant' turns. Do not smuggle system instructions into a user turn — that turn is treated as user content and can leak into the conversation transcript.

Pass / FailAi Platformhigh
02

A /v2/chat response ends with finish_reason='MAX_TOKENS' and an obviously truncated final sentence. The agent presents it as a complete answer.

Inspect finish_reason on message-end (e.g., COMPLETE, MAX_TOKENS, STOP_SEQUENCE, TOOL_CALL). On MAX_TOKENS, mark the output partial and either raise to the caller or continue by appending the truncated assistant turn — never surface a truncated answer as complete.

Pass / FailAi Platformcritical
03

A /v2/chat response carries a usage object (billed_units / tokens with input and output counts). The agent estimates cost from response string length instead of reading usage.

Read input and output token counts from the response usage object for cost telemetry. Do not estimate tokens from character length. Aggregate usage from message-end when streaming.

Pass / FailAi Platformmedium

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Cohere
  • Ai Platform
  • Chat Api And Streaming

Recommended for

Cohere APICohere customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.