For AssemblyAIAI PlatformTranscription Accuracy

Batch Transcription Universal 2

AssemblyAI (Universal-2 + LeMUR) · AssemblyAI

Speech AI Platform — AssemblyAI

Evaluates AssemblyAI's Batch Transcription (Universal-2) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Speech AI Platform eval coverage.

About AssemblyAI

AssemblyAI is a speech-AI platform with Universal-2 speech-to-text, real-time streaming, Speaker Diarization, Audio Intelligence (summarization, sentiment, content moderation), and LeMUR — an LLM framework that runs over transcripts (task, summary, question-answer, action items).

Employees

~150

Industry

Speech AI

Headquarters

San Francisco, CA

Website

www.assemblyai.com

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Agent submits POST /v2/transcript with audio_url pointing to a presigned S3 URL that expires in 60 seconds. Universal-2 processing starts 3 minutes later.	Either (a) upload the audio bytes via POST /v2/upload first and use the returned upload_url as audio_url (single-use; AssemblyAI fetches synchronously before returning), or (b) ensure audio_url remains fetchable for the full queue+processing window. Do not assume Universal-2 dereferences audio_url …	Pass / FailAi Platformhigh
02	Agent submits a noisy call-center recording and sets speech_model='nano' to save cost. WER on the transcript is markedly worse than expected.	speech_model='universal' is the Universal-2 default with the best accuracy envelope; 'nano' trades accuracy for cost/latency. For call-center or domain-critical workloads default to 'universal'. Document the model choice per workload and reconcile WER expectations against the chosen tier [REQUIRES-…	Pass / FailAi Platformmedium
03	Agent ingests a mixed multilingual corpus and sets language_detection=true on every job to avoid having to pre-classify language.	Use language_detection=true when language is unknown; use language_code (e.g., 'es', 'fr') when known to skip detection cost/latency and avoid mislabeling. Read response.language_code to confirm what was used. Do not pass both with expectation that language_code wins silently.	Pass / FailAi Platformmedium
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Assemblyai
Ai Platform
Batch Transcription Universal 2

Recommended for

AssemblyAI (Universal-2 + LeMUR)AssemblyAI customers

Works with

AssemblyAI

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Batch Transcription Universal 2 eval for AssemblyAI AssemblyAI (Universal-2 + LeMUR) test?+

How is the Batch Transcription Universal 2 eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Batch Transcription Universal 2 pack for AssemblyAI AssemblyAI (Universal-2 + LeMUR) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Batch Transcription Universal 2 pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.