Lemur
AssemblyAI (Universal-2 + LeMUR) · AssemblyAI
Speech AI Platform — AssemblyAI
AssemblyAI evals — LeMUR (relift v3 InfraRed)
About AssemblyAI
AssemblyAI is a speech-AI platform with Universal-2 speech-to-text, real-time streaming, Speaker Diarization, Audio Intelligence (summarization, sentiment, content moderation), and LeMUR — an LLM framework that runs over transcripts (task, summary, question-answer, action items).
Sample tests· showing 3 of 9
| # | Input | Expected behavior | Check |
|---|---|---|---|
| 01 | Agent calls POST /lemur/v3/generate/task with both transcript_ids=['t_1','t_2'] and input_text='...' for the same prompt. | Use transcript_ids[] when the LLM should run over AssemblyAI-produced transcripts; use input_text for arbitrary text. Combining both has documented precedence — verify the docs and pick one cleanly. Multi-transcript LeMUR concatenates transcripts under a context budget set by final_model's context … | Pass / FailAi Platformhigh |
| 02 | Agent wants Claude 3.5 Sonnet specifically for a LeMUR summary task. Sets final_model='anthropic/claude-3-5-sonnet'. | final_model selects the LLM backing LeMUR (default | basic | anthropic/* per current docs). Pin a model id you've evaluated; do not rely on 'default' across releases as the underlying model may shift. Verify available model ids in current docs [REQUIRES-VERIFICATION]. | Pass / FailAi Platformmedium |
| 03 | Agent asks LeMUR question_answer 'What is the company's Q4 revenue?' on a transcript that does not discuss financials. | LeMUR may hallucinate when asked about topics not present in the transcript. Prompt-engineer with explicit 'If the transcript does not contain the answer, say so' clauses. Validate outputs against source spans; treat unverifiable answers as failures, not as best-effort guesses. | Pass / FailAi Platformcritical |
How this eval is graded
Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.
Rubric criteria
- Assemblyai
- Ai Platform
- Lemur
Recommended for
Works with
Related evals
Run this eval in your workspace
Connect your data, configure thresholds, and review results with your team.