For LiveKitAI Platform

Livekit Agents Framework

LiveKit (Cloud + Agents) · LiveKit

Real-time Voice & Video Infra — LiveKit

Evaluates LiveKit's LiveKit Agents Framework across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Real-time Voice & Video Infra eval coverage.

About LiveKit

LiveKit is open-source real-time voice/video infrastructure used to build voice agents and live experiences — a WebRTC SFU, telephony (SIP), recording/egress, and the LiveKit Agents framework for STT→LLM→TTS pipelines, available as LiveKit Cloud and self-hosted.

Employees

~50

Industry

Voice AI Infrastructure

Headquarters

New York, NY

Website

livekit.io

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Operator runs `python agent.py start` to register a worker against LiveKit Cloud. Worker prints 'registered' but no agent ever joins a room.	Worker registration only advertises availability — agents join rooms via (a) automatic dispatch matching the worker's room-name pattern, or (b) explicit AgentDispatchService.CreateDispatch from server code. Verify the dispatch path is wired; do not assume 'registered' implies 'joined'.	Pass / FailAi Platformhigh
02	Operator builds a VoicePipelineAgent with Deepgram STT, OpenAI LLM, Cartesia TTS. End-to-end latency to first audible token is reported as 1.8s.	Latency budget composes: STT partial-to-final + LLM TTFB + TTS first-byte + SFU RTT. Each plugin reports its own metrics; verify which stage dominates before changing components. Aggregate p50/p95 numbers are [REQUIRES-VERIFICATION] and provider-coupled — do not claim a fixed total in customer-faci…	Pass / FailAi Platformhigh
03	User says 'Hold on…' with a 1.2s pause mid-sentence. The agent interrupts and responds, treating the pause as end-of-turn.	Tune Silero VAD min_silence_duration (or equivalent) to a value larger than typical inter-word pauses for the target population (e.g., 800–1200ms for thoughtful speakers). Combine with a small grace window after partial-transcript stability. Do not rely solely on VAD with default thresholds.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Livekit
Ai Platform
Livekit Agents Framework

Recommended for

LiveKit (Cloud + Agents)LiveKit customers

Works with

LiveKit

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Livekit Agents Framework eval for LiveKit LiveKit (Cloud + Agents) test?+

How is the Livekit Agents Framework eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Livekit Agents Framework pack for LiveKit LiveKit (Cloud + Agents) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Livekit Agents Framework pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.