Pick your stack on the left, customize prompts in the middle, talk on the right. Live latency, real per-call cost, transcript.
Consistency / Similarity / Enhancement apply to lightning-v2 only. We ship lightning-v3.1 in the dropdown — these knobs are future-proofing for operators who hand-edit the saved config.
Per-stage time-to-first-token / first-audio-byte, measured during the call. Empty until the first turn lands.
Two independent pipelines control conversation flow. Tuning one doesn't affect the other.
After the user stops speaking, the agent waits, watches a turn-detector ML model, then commits to reply.
wait = MIN(
Take turn,
Response rate floor + Linear delay
) + turn-detector hesitation
where Response rate floor =
Rapid → 200ms · Balanced → 500ms · Patient → 1000ms
Custom → whatever you type in Endpointing
There's also a ~250ms VAD silence floor underneath everything (Silero default, hardcoded). It stacks BEFORE this formula starts counting.
While the agent is speaking, the user's voice may cut it off. Cutoff requires ALL of:
AEC warmup has elapsed
user speech ≥ Min interrupt
user words ≥ Min words
| Setting | What it does | Default | Snappy | Phone | Patient |
|---|---|---|---|---|---|
| Response speed (Pipeline A) | |||||
| Response rate | Minimum silence wait after you stop speaking before the agent replies. Rapid=200ms, Balanced=500ms, Patient=1000ms, Custom=reveals an Endpointing input where you type the exact ms. | Balanced | Rapid | Balanced | Patient |
| Endpointing (ms) | Only visible when Response rate = Custom. Sets the exact silence floor in milliseconds. | — | — | — | — |
| Linear delay (ms) | Padding added ON TOP of the floor. Lets you give extra breathing room for thinkers without changing the preset. | 0 | 0 | 0 | 200 |
| Take turn (ms) | Ceiling. If turn-detector ML stays uncertain this long, force-commit the turn anyway. | 2000 | 1000 | 1500 | 3000 |
| Turn-detector threshold | Confidence (0–1) the semantic EOU model needs before committing a turn within min_delay. Below threshold the session keeps waiting up to Take turn. Empty = use the model's per-language default. |
auto | auto | auto | 0.35 |
| User interruption — how callers cut into the agent (Pipeline B) | |||||
| AEC warmup (sec) | After call starts, NO interrupts allowed for this long while echo-cancellation converges. Lower = snappier early interrupts, higher = cleaner audio. | 0.5 | 0.3 | 1.0 | 1.5 |
| Min interrupt (ms) | Your speech must last at least this long to count as an interruption. Lower = easier interrupt, more false positives. | 300 | 200 | 400 | 600 |
| Min words | Your transcript must contain at least this many words. 1 = even "stop" interrupts. |
2 | 1 | 2 | 3 |
| Resume false intrr. | If you triggered an interrupt but stopped quickly, resume the agent's cut-off reply automatically. | ON | ON | ON | ON |
| Resume after (sec) | How long to wait before resuming the cut-off reply. | 1.5 | 1.5 | 1.5 | 2.0 |
| Other | |||||
| Preemptive LLM | Start the LLM call before turn-detector fully commits. Saves ~250 ms per turn. | ON | ON | ON | ON |
| IVR detection | Detect "Press 1 for sales" auto-attendants. Mainly useful for outbound calls to call centres. | OFF | OFF | OFF | OFF |
| Noise cancellation | Krisp neural NC on inbound audio. Auto picks BVC for web, BVCTelephony for phone. Adds ~5-10% CPU per active call on the worker. | Off | Auto | Auto | Auto |
| Background denoising | Light inbound noise filter (NC). Redundant — and auto-disabled — when Krisp NC above is on. Useful only when NC is Off but the caller environment is noisy. | OFF | OFF | OFF | OFF |
| First message | How the agent's first words are produced. Assistant first = speak the fixed greeting text. Wait for caller = stay silent until caller speaks. LLM-generated = agent improvises a greeting from your instructions. | Assistant first | Assistant first | Assistant first | Assistant first |
| Welcome delay (ms) | Sleep this long before the agent's first greeting. Useful when phone audio path needs a moment to settle. | 0 | 0 | 500 | 0 |
| Hangup on silence | Auto-disconnect if the user is silent for the threshold. Useful for unmanned outbound. | OFF | OFF | OFF / 30s | OFF |
| Online check msg / after | Probe the user with a short message after N seconds of silence ("Are you still there?"). Fires before hangup. | "Are you still there?" / 9s | — | "Are you still there?" / 10s | "Still there?" / 20s |
| Max duration (sec) | Hard ceiling on call length. Defensive against runaway calls. | 600 | 600 | 900 | 900 |
| STT — Deepgram Nova advanced | |||||
| DG endpointing (ms) | Silence Deepgram waits before emitting its own end-of-utterance hint. The plugin's built-in 25ms is far too aggressive — Deepgram themselves recommend 300–500ms. Lower = snappier but more mid-sentence cutoffs on slow speakers. | 300 | 200 | 300 | 500 |
| DG filler words | Keep "um"/"uh"/"hmm" in the transcript. When OFF those tokens get stripped and the turn-detector sees a "done" transcript while the user is still hesitating — the #1 cause of premature EOU. | ON | ON | ON | ON |
| DG smart format | Add punctuation, numeric formatting, etc. to transcripts. Improves prompt readability and unlocks future punctuation-based endpointing modes. | ON | ON | ON | ON |
| STT — Deepgram Flux advanced (server-side EOU) | |||||
| EOT threshold | Confidence (0.5–0.9) Flux's own end-of-turn model needs before declaring EOT. Plugin default 0.7. Lower = snappier but more mid-sentence cutoffs. | 0.7 | 0.5 | 0.7 | 0.8 |
| EOT timeout (ms) | Maximum ms Flux waits for a confident EOT before forcing one. Plugin default 3000. Lowering trims worst-case latency at the cost of cutting off slow speakers. | 3000 | 2000 | 3000 | 4500 |
| Eager EOT | Early-fire confidence (0.3–0.9) that lets Flux emit a "likely end of turn" signal ahead of the final EOT, so preemptive LLM generation can kick in sooner. Must be ≤ EOT threshold. Blank = disabled. | off | 0.3 | off | off |
| Audio environment | |||||
| Ambient sound | Background loop that makes the agent sound "in the room". 7 built-in clips (office, crowded room, city, forest, etc.). Local files — no API cost. | Off | Off | Office | Off |
| Ambient volume | 0.0 (silent) to 1.0 (same level as agent). Phone calls usually want 0.15–0.25 because compressed audio masks consonants. | 0.30 | 0.30 | 0.20 | 0.30 |
| TTS — ElevenLabs advanced | |||||
| Stability | 0.0 = highly emotional/variable delivery, 1.0 = flat/monotone. Lower values feel more "alive" but more unpredictable across renders. | 0.50 | 0.40 | 0.55 | 0.55 |
| Similarity boost | How strictly the synthesized speech tracks the chosen voice's clone. Higher = closer to the source recording; lower = more freedom to drift. | 0.75 | 0.75 | 0.80 | 0.75 |
| Style exaggeration | 0 = neutral (fastest synthesis), 1 = exaggerated emotion. Higher values add noticeable latency and can introduce artefacts. | 0 | 0 | 0 | 0 |
| Speaker boost | Extra processing pass that nudges similarity toward the source voice. Mild latency cost. On by default. | on | off | on | on |
| Speed | Speech rate multiplier. Plugin range 0.8–1.2 (narrower than the ElevenLabs web UI's 0.7–1.2). 1.0 is the model's natural pace. | 1.00 | 1.05 | 1.00 | 0.95 |
| Auto mode | Disables chunk-schedule buffering for lower latency. Plugin enables this by default — unchecking re-introduces buffering for higher-quality concatenation between phrases. Recommended ON. | on | on | on | on |
| Streaming latency | 0 (default) – 4 (max optimization). Deprecated upstream but still functional. Leave blank to skip and use the plugin default. | — | 3 | — | — |
| TTS — Sarvam advanced | |||||
| Pitch | Voice pitch shift, -0.75 (deeper) to +0.75 (higher). 0 = natural. v2 models only — silently dropped for v3 / v3-beta. | 0 | 0 | 0 | 0 |
| Pace | Speech rate multiplier, 0.3 (very slow) to 3.0 (very fast). 1.0 is the model's natural pace. | 1.00 | 1.10 | 1.00 | 0.90 |
| Loudness | Output volume multiplier, 0.5 to 2.0. v2 models only. Use sparingly — over-boosting clips on phone calls. | 1.00 | 1.00 | 1.10 | 1.00 |
| Preprocessing | Text preprocessing (numbers, dates, abbreviations spelled out). bulbul:v2 only. Helps prevent "Rs. 2024" being read as "rupees two thousand twenty-four" in unintended places. | off | off | on | on |
| Sample rate | Audio sample rate (Hz). Allowed: 8000 / 16000 / 22050 / 24000 / 32000 / 44100 / 48000. 22050 is the sweet spot for voice; phone trunks downsample to 8000 anyway. | 22050 | 22050 | 22050 | 22050 |
| STT — Smallest.ai advanced | |||||
| Sample rate | Audio sample rate (Hz) the WebSocket stream uses. Allowed: 8000 / 16000 / 22050 / 24000 / 44100 / 48000. 16000 is the plugin default and the right value for voice; 8000 only when feeding straight from a narrowband phone trunk. | 16000 | 16000 | 16000 | 16000 |
| Encoding | PCM encoding of the input audio stream. linear16 is the plugin default and the most-compatible choice; mulaw/alaw for phone trunks; opus/ogg_opus if the source is already Opus. | linear16 | linear16 | mulaw | linear16 |
| Word timestamps | Per-word start/end timestamps + confidence in transcripts. Plugin default ON. Turn off only to shave a few bytes per response — saves no latency. | ON | ON | ON | ON |
| Diarization | Per-word speaker IDs (integers during streaming). Plugin default OFF. Turn on only when you need to split a multi-speaker audio stream — adds upstream cost. | OFF | OFF | OFF | OFF |
| EOU timeout (ms) | Server-side end-of-utterance silence threshold. Plugin default 0 = disabled, which is the right value — LiveKit's own turn detector owns EOU. Setting >0 stacks Smallest's server-side EOU on top, adding latency for no upside. | off | off | off | off |
| TTS — Smallest.ai advanced | |||||
| Speed | Speech rate multiplier. Plugin default 1.0. Lightning v3.1 sounds natural in 0.85–1.15; outside that range gets noticeably robotic. | 1.00 | 1.10 | 1.00 | 0.90 |
| Sample rate | Audio sample rate (Hz). Plugin default 24000 is the right value for Lightning v3.1 quality; lower only when bandwidth is tight. | 24000 | 24000 | 24000 | 24000 |
| Output format | Audio container/encoding. pcm is the plugin default and the right pick for piping into LiveKit. mulaw/alaw only when terminating directly into a SIP trunk that wants narrowband. | pcm | pcm | pcm | pcm |
| Consistency / Similarity / Enhancement | lightning-v2 ONLY — the plugin silently drops these on v3.1. Future-proofing for operators who hand-edit the saved config to switch back to v2. Defaults: consistency 0.5, similarity 0, enhancement 1. | — | — | — | — |
Expected EOU drop: 1.5s → 0.6s
Risk: more false positives. Coughs, breaths, background TV → may cut off the agent. Bump back up if it happens.
Trade-off: slower replies. Use only if users are elderly / non-native / pause a lot.
If Krisp licence isn't installed, BVC falls back to a basic filter — still removes most steady-state noise (fans, traffic) but not transient sounds (door slam, cough).
Uses local files shipped with livekit-agents — no API or per-minute cost.
Watch the live Latency Breakdown card during a call. Numbers come from real metrics.
Smallest.ai's Pulse STT is the cheapest in the dropdown (~₹0.48/min vs Deepgram ~₹0.46/min for nova-3-en) and Lightning v3.1 TTS supports Hindi + code-mixing.
First call may add ~150ms while their CDN warms up — subsequent calls are sub-200ms.
The STT is committing end-of-turn during the brief pause before the filler word lands.
Works best with Hindi / multilingual callers who naturally use "matlab" / "yaani" / "haan" as fillers.
Every knob on this page round-trips through configs/default.json:
default.json, so what you saved is what you see.Phone calls (Plivo / Twilio) don't carry per-call overrides, so they always use the saved default. For web calls, the UI can also send overrides per-call without saving.
Placeholders like {agent_name}
found in your greeting / prefix / base prompt. Whatever you enter here
replaces them before the call is sent — so the same template works for
many calls.
Describe what the voice agent should do — its role, the goal of the call, key flow notes. The LLM expands this into a polished prompt; voice-call rules, language and persona are added automatically as the prefix.