For LiveKitAI Platform

Telephony Sip

LiveKit (Cloud + Agents) · LiveKit

Real-time Voice & Video Infra — LiveKit

Evaluates LiveKit's Telephony (SIP) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Real-time Voice & Video Infra eval coverage.

About LiveKit

LiveKit is open-source real-time voice/video infrastructure used to build voice agents and live experiences — a WebRTC SFU, telephony (SIP), recording/egress, and the LiveKit Agents framework for STT→LLM→TTS pipelines, available as LiveKit Cloud and self-hosted.

Employees

~50

Industry

Voice AI Infrastructure

Headquarters

New York, NY

Website

livekit.io

Sample tests· showing 3 of 9

#	Input	Expected behavior	Check
01	Two dispatch rules match an inbound number — Rule A (direct to room 'support') and Rule B (individual rooms per call).	Per LiveKit SIP docs, dispatch rules are evaluated in order; the first match wins. Inspect the configured ordering and make precedence explicit. Do not rely on creation-time order across deployments; specify it in the IaC.	Pass / FailAi Platformhigh
02	Server initiates an outbound call: CreateSIPParticipant({trunk_id, sip_call_to:'+15551234567', room_name:'call-xyz', participant_identity:'callee'}).	Verify the trunk is configured for outbound (CreateSIPOutboundTrunk done); the trunk auth (username, password, address) matches the carrier; from-number is an authorized caller ID with the carrier. Surface the carrier's SIP response (486 busy, 480 unavailable, 603 declined) back to the operator — d…	Pass / FailAi Platformcritical
03	User on a PSTN call presses '1'. Agent expects to read it from the audio transport but sees nothing.	LiveKit SIP supports RFC 2833 (RTP telephone-event) DTMF — DTMF events arrive as participant events, not as in-band audio. Subscribe to the DTMF event API in Agents; do not run an in-band Goertzel detector. SIP INFO and in-band DTMF are carrier-coupled and out of LiveKit's contract.	Pass / FailAi Platformhigh
Unlock full benchmark 6 more test cases Use this benchmark

How this eval is graded

Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

Rubric criteria

Livekit
Ai Platform
Telephony Sip

Recommended for

LiveKit (Cloud + Agents)LiveKit customers

Works with

LiveKit

Related evals

AI Platform

Claude API

Evaluates Anthropic's Batch API across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Extended Thinking across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View AI Platform

Claude API

Evaluates Anthropic's Files API & Citations across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Foundation Model & API eval coverage.

View

Frequently asked questions

What does the Telephony Sip eval for LiveKit LiveKit (Cloud + Agents) test?+

Evaluates LiveKit's Telephony (SIP) across 9 scenario-based test cases, each graded against an expected-behavior rubric by an LLM judge, from Corsac's Real-time Voice & Video Infra eval coverage.

How is the Telephony Sip eval scored?+

The judge rubric: Grade against expected.ideal_behavior and expected.rubric. Per-criterion pass requires mean >= 4.0 and no criterion below 3.

How many test cases does this eval pack include?+

The Telephony Sip pack for LiveKit LiveKit (Cloud + Agents) contains 9 test cases. 3 sample cases are shown free on this page; the full set runs in a Corsac workspace.

How do I run this eval?+

Sign up for Corsac, connect your model or agent endpoint, and run the Telephony Sip pack as-is or after customizing thresholds. Results land in your workspace with per-case scores, and you can gate releases on the pack in CI via the REST API.

Run this eval in your workspace

Connect your data, configure thresholds, and review results with your team.