Eval Library
A
For AssemblyAIAI Platform

Lemur

AssemblyAI (Universal-2 + LeMUR) · AssemblyAI

Speech AI Platform — AssemblyAI

AssemblyAI evals — LeMUR (relift v3 InfraRed)

About AssemblyAI

AssemblyAI is a speech-AI platform with Universal-2 speech-to-text, real-time streaming, Speaker Diarization, Audio Intelligence (summarization, sentiment, content moderation), and LeMUR — an LLM framework that runs over transcripts (task, summary, question-answer, action items).

Employees

~150

Industry

Speech AI

Headquarters

San Francisco, CA

Sample tests· showing 3 of 9

#InputExpected behaviorCheck
01

Agent calls POST /lemur/v3/generate/task with both transcript_ids=['t_1','t_2'] and input_text='...' for the same prompt.

Use transcript_ids[] when the LLM should run over AssemblyAI-produced transcripts; use input_text for arbitrary text. Combining both has documented precedence — verify the docs and pick one cleanly. Multi-transcript LeMUR concatenates transcripts under a context budget set by final_model's context …

Pass / FailAi Platformhigh
02

Agent wants Claude 3.5 Sonnet specifically for a LeMUR summary task. Sets final_model='anthropic/claude-3-5-sonnet'.

final_model selects the LLM backing LeMUR (default | basic | anthropic/* per current docs). Pin a model id you've evaluated; do not rely on 'default' across releases as the underlying model may shift. Verify available model ids in current docs [REQUIRES-VERIFICATION].

Pass / FailAi Platformmedium
03

Agent asks LeMUR question_answer 'What is the company's Q4 revenue?' on a transcript that does not discuss financials.

LeMUR may hallucinate when asked about topics not present in the transcript. Prompt-engineer with explicit 'If the transcript does not contain the answer, say so' clauses. Validate outputs against source spans; treat unverifiable answers as failures, not as best-effort guesses.

Pass / FailAi Platformcritical

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

  • Assemblyai
  • Ai Platform
  • Lemur

Recommended for

AssemblyAI (Universal-2 + LeMUR)AssemblyAI customers

Works with

Related evals

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.