Superleap Voice Platform

How the agent decides when to speak & whether you can interrupt

Two independent pipelines control conversation flow. Tuning one doesn't affect the other.

Pipeline A — Response speed

After the user stops speaking, the agent waits, watches a turn-detector ML model, then commits to reply.


wait = MIN(
  Take turn,
  Response rate floor + Linear delay
) + turn-detector hesitation

where Response rate floor =
  Rapid → 200ms · Balanced → 500ms · Patient → 1000ms
  Custom → whatever you type in Endpointing

There's also a ~250ms VAD silence floor underneath everything (Silero default, hardcoded). It stacks BEFORE this formula starts counting.

Pipeline B — User interrupting the agent

While the agent is speaking, the user's voice may cut it off. Cutoff requires ALL of:


  AEC warmup has elapsed
  user speech ≥ Min interrupt
  user words ≥ Min words

Settings reference

Setting	What it does	Default	Snappy	Phone	Patient
Response speed (Pipeline A)
Response rate	Minimum silence wait after you stop speaking before the agent replies. Rapid=200ms, Balanced=500ms, Patient=1000ms, Custom=reveals an Endpointing input where you type the exact ms.	Balanced	Rapid	Balanced	Patient
Endpointing (ms)	Only visible when Response rate = Custom. Sets the exact silence floor in milliseconds.	—	—	—	—
Linear delay (ms)	Padding added ON TOP of the floor. Lets you give extra breathing room for thinkers without changing the preset.	0	0	0	200
Take turn (ms)	Ceiling. If turn-detector ML stays uncertain this long, force-commit the turn anyway.	2000	1000	1500	3000
Turn-detector threshold	Confidence (0–1) the semantic EOU model needs before committing a turn within `min_delay`. Below threshold the session keeps waiting up to Take turn. Empty = use the model's per-language default.	auto	auto	auto	0.35
User interruption — how callers cut into the agent (Pipeline B)
AEC warmup (sec)	After call starts, NO interrupts allowed for this long while echo-cancellation converges. Lower = snappier early interrupts, higher = cleaner audio.	0.5	0.3	1.0	1.5
Min interrupt (ms)	Your speech must last at least this long to count as an interruption. Lower = easier interrupt, more false positives.	300	200	400	600
Min words	Your transcript must contain at least this many words. `1` = even "stop" interrupts.	2	1	2	3
Resume false intrr.	If you triggered an interrupt but stopped quickly, resume the agent's cut-off reply automatically.	ON	ON	ON	ON
Resume after (sec)	How long to wait before resuming the cut-off reply.	1.5	1.5	1.5	2.0
Other
Preemptive LLM	Start the LLM call before turn-detector fully commits. Saves ~250 ms per turn.	ON	ON	ON	ON
IVR detection	Detect "Press 1 for sales" auto-attendants. Mainly useful for outbound calls to call centres.	OFF	OFF	OFF	OFF
Noise cancellation	Krisp neural NC on inbound audio. Auto picks BVC for web, BVCTelephony for phone. Adds ~5-10% CPU per active call on the worker.	Off	Auto	Auto	Auto
Background denoising	Light inbound noise filter (NC). Redundant — and auto-disabled — when Krisp NC above is on. Useful only when NC is Off but the caller environment is noisy.	OFF	OFF	OFF	OFF
First message	How the agent's first words are produced. Assistant first = speak the fixed greeting text. Wait for caller = stay silent until caller speaks. LLM-generated = agent improvises a greeting from your instructions.	Assistant first	Assistant first	Assistant first	Assistant first
Welcome delay (ms)	Sleep this long before the agent's first greeting. Useful when phone audio path needs a moment to settle.	0	0	500	0
Hangup on silence	Auto-disconnect if the user is silent for the threshold. Useful for unmanned outbound.	OFF	OFF	OFF / 30s	OFF
Online check msg / after	Probe the user with a short message after N seconds of silence ("Are you still there?"). Fires before hangup.	"Are you still there?" / 9s	—	"Are you still there?" / 10s	"Still there?" / 20s
Max duration (sec)	Hard ceiling on call length. Defensive against runaway calls.	600	600	900	900
STT — Deepgram Nova advanced
DG endpointing (ms)	Silence Deepgram waits before emitting its own end-of-utterance hint. The plugin's built-in 25ms is far too aggressive — Deepgram themselves recommend 300–500ms. Lower = snappier but more mid-sentence cutoffs on slow speakers.	300	200	300	500
DG filler words	Keep "um"/"uh"/"hmm" in the transcript. When OFF those tokens get stripped and the turn-detector sees a "done" transcript while the user is still hesitating — the #1 cause of premature EOU.	ON	ON	ON	ON
DG smart format	Add punctuation, numeric formatting, etc. to transcripts. Improves prompt readability and unlocks future punctuation-based endpointing modes.	ON	ON	ON	ON
STT — Deepgram Flux advanced (server-side EOU + language)
Language strategy	Detect-then-lock (recommended for multilingual): starts with the broad "Initial hints" set, then narrows to the first turn's detected language for monolingual-grade accuracy. Re-locks after ≥2 consecutive turns in a different language. Static = never change. Auto-detect = no hints, risks misidentification (Hindi → Russian per Deepgram's own warning). flux-general-multi only.	detect	detect	static	static
Initial hints	Languages Flux should expect at call start. With Detect-then-lock: starting set for turn 1, narrowed afterward. With Static: used unchanged for every turn. India default: English + Hindi (monolingual-grade for both, falls through to auto-detect for outliers).	en, hi	en, hi	en	en
EOT threshold	Confidence (0.5–0.9) Flux's own end-of-turn model needs before declaring EOT. Plugin default 0.7; our default 0.6 (paired with conservative VAD turn-detection). Lower = snappier but more mid-sentence cutoffs.	0.6	0.5	0.7	0.8
EOT timeout (ms)	Maximum ms Flux waits for a confident EOT before forcing one. Plugin default 3000. Lowering trims worst-case latency at the cost of cutting off slow speakers.	3000	2000	3000	4500
Eager EOT	Early-fire confidence (0.3–0.9) that lets Flux emit a "likely end of turn" signal ahead of the final EOT. Wired into LiveKit's preemptive_generation: when set, the LLM call starts ~150ms earlier on every turn (auto-cancelled if the user keeps talking via TurnResumed). Must be ≤ EOT threshold. Blank = disabled.	0.4	0.3	off	off
Audio environment
Ambient sound	Background loop that makes the agent sound "in the room". 7 built-in clips (office, crowded room, city, forest, etc.). Local files — no API cost.	Off	Off	Office	Off
Ambient volume	0.0 (silent) to 1.0 (same level as agent). Phone calls usually want 0.15–0.25 because compressed audio masks consonants.	0.30	0.30	0.20	0.30
TTS — ElevenLabs advanced
Stability	0.0 = highly emotional/variable delivery, 1.0 = flat/monotone. Lower values feel more "alive" but more unpredictable across renders.	0.50	0.40	0.55	0.55
Similarity boost	How strictly the synthesized speech tracks the chosen voice's clone. Higher = closer to the source recording; lower = more freedom to drift.	0.75	0.75	0.80	0.75
Style exaggeration	0 = neutral (fastest synthesis), 1 = exaggerated emotion. Higher values add noticeable latency and can introduce artefacts.	0	0	0	0
Speaker boost	Extra processing pass that nudges similarity toward the source voice. Mild latency cost. On by default.	on	off	on	on
Speed	Speech rate multiplier. Plugin range 0.8–1.2 (narrower than the ElevenLabs web UI's 0.7–1.2). 1.0 is the model's natural pace.	1.00	1.05	1.00	0.95
Auto mode	Disables chunk-schedule buffering for lower latency. Plugin enables this by default — unchecking re-introduces buffering for higher-quality concatenation between phrases. Recommended ON.	on	on	on	on
Streaming latency	0 (default) – 4 (max optimization). Deprecated upstream but still functional. Leave blank to skip and use the plugin default.	—	3	—	—
TTS — Sarvam advanced
Pitch	Voice pitch shift, -0.75 (deeper) to +0.75 (higher). 0 = natural. v2 models only — silently dropped for v3 / v3-beta.	0	0	0	0
Pace	Speech rate multiplier, 0.3 (very slow) to 3.0 (very fast). 1.0 is the model's natural pace.	1.00	1.10	1.00	0.90
Loudness	Output volume multiplier, 0.5 to 2.0. v2 models only. Use sparingly — over-boosting clips on phone calls.	1.00	1.00	1.10	1.00
Preprocessing	Text preprocessing (numbers, dates, abbreviations spelled out). bulbul:v2 only. Helps prevent "Rs. 2024" being read as "rupees two thousand twenty-four" in unintended places.	off	off	on	on
Sample rate	Audio sample rate (Hz). Allowed: 8000 / 16000 / 22050 / 24000 / 32000 / 44100 / 48000. 22050 is the sweet spot for voice; phone trunks downsample to 8000 anyway.	22050	22050	22050	22050
STT — Smallest.ai advanced
Sample rate	Audio sample rate (Hz) the WebSocket stream uses. Allowed: 8000 / 16000 / 22050 / 24000 / 44100 / 48000. 16000 is the plugin default and the right value for voice; 8000 only when feeding straight from a narrowband phone trunk.	16000	16000	16000	16000
Encoding	PCM encoding of the input audio stream. linear16 is the plugin default and the most-compatible choice; mulaw/alaw for phone trunks; opus/ogg_opus if the source is already Opus.	linear16	linear16	mulaw	linear16
Word timestamps	Per-word start/end timestamps + confidence in transcripts. Plugin default ON. Turn off only to shave a few bytes per response — saves no latency.	ON	ON	ON	ON
Diarization	Per-word speaker IDs (integers during streaming). Plugin default OFF. Turn on only when you need to split a multi-speaker audio stream — adds upstream cost.	OFF	OFF	OFF	OFF
EOU timeout (ms)	Server-side end-of-utterance silence threshold. Plugin default 0 = disabled, which is the right value — LiveKit's own turn detector owns EOU. Setting >0 stacks Smallest's server-side EOU on top, adding latency for no upside.	off	off	off	off
TTS — Smallest.ai advanced
Speed	Speech rate multiplier. Plugin default 1.0. Lightning v3.1 sounds natural in 0.85–1.15; outside that range gets noticeably robotic.	1.00	1.10	1.00	0.90
Sample rate	Audio sample rate (Hz). Plugin default 24000 is the right value for Lightning v3.1 quality; lower only when bandwidth is tight.	24000	24000	24000	24000
Output format	Audio container/encoding. pcm is the plugin default and the right pick for piping into LiveKit. mulaw/alaw only when terminating directly into a SIP trunk that wants narrowband.	pcm	pcm	pcm	pcm
Consistency / Similarity / Enhancement	lightning-v2 ONLY — the plugin silently drops these on v3.1. Future-proofing for operators who hand-edit the saved config to switch back to v2. Defaults: consistency 0.5, similarity 0, enhancement 1.	—	—	—	—

Scenarios — what to tune for what

"Faster response, please"

Set Response rate → Rapid (drops floor 500→200ms)
Set Linear delay → 0 (removes additive padding)
Set Take turn → 1000ms (force commit sooner when uncertain)
Keep Preemptive LLM ON (saves another ~250ms)
Lower STT model latency — Deepgram Flux beats Sarvam Saaras by ~200ms

Expected EOU drop: 1.5s → 0.6s

"Easier to interrupt the agent"

Lower AEC warmup → 0.3s (interrupts work right from call start)
Lower Min interrupt → 200ms (shorter user speech triggers)
Lower Min words → 1 (even "stop" works)
Keep Resume false intrr. ON (recovers from accidental cuts)

Risk: more false positives. Coughs, breaths, background TV → may cut off the agent. Bump back up if it happens.

"Stop the agent from cutting off slow users"

Set Response rate → Patient (1s floor)
Set Take turn → 3000ms (allow long pauses)
Add Linear delay → 300ms (extra padding)
Raise Min words → 3 (don't react to single "um" interjections)

Trade-off: slower replies. Use only if users are elderly / non-native / pause a lot.

"Phone calls dropping noisy interrupts"

Set AEC warmup → 1.0s (phone AEC needs more convergence)
Set Min interrupt → 400ms (filter line noise)
Set Min words → 2 (filter cough/breath)
Try a different noise-cancellation on the SIP trunk

"Outbound calls hitting IVR menus"

Turn IVR detection ON
Raise Min words → 3 (avoid reacting to "Press 1 for…")
Bump Max duration → 900s (IVR navigation can take time)

"Caller's environment is noisy"

Set Noise cancellation → Auto in Conversation behavior → Advanced.
For phone calls (Plivo / Twilio), Auto picks BVCTelephony — tuned for 8 kHz narrowband.
For web calls, Auto picks BVC — tuned for 16 kHz wideband.
Both add 5–10% CPU per call. Cap concurrent calls accordingly.

If Krisp licence isn't installed, BVC falls back to a basic filter — still removes most steady-state noise (fans, traffic) but not transient sounds (door slam, cough).

"TV / café in background — agent misses my words"

Krisp BVC removes most non-speech noise but struggles with speech-on-speech (TV dialogue, café conversations, open-plan offices). The VAD wakes up on the ambient speech and the agent either pauses or fires off a response to a garbled transcript.

Click Noisy room in the One-click presets at the top of this modal — one tap sets all 8 knobs below.
Manually: Noise cancellation = Auto, Min interrupt = 1000 ms, Min words = 4, Resume after = 2.5 s, Take turn = 3000 ms, Turn-detector threshold = 0.40.
If using Deepgram Flux: bump EOT threshold to 0.85 and EOT timeout to 5000 ms.
For the deepest fix, set the worker env var VAD_ACTIVATION_THRESHOLD=0.7 and restart the agent — Silero's VAD will then ignore quieter speech-like background entirely. Default is 0.55.

Hard truth: no software solution fully handles a TV at normal volume sitting next to the mic. The real fix is to lower the TV, use a directional mic, or call from a quieter room. Software can only paper over so much.

"Make the agent feel like a real person"

Under Conversation behavior → Advanced → Audio environment, pick an ambient (Office, Call centre, City…).
Keep Volume around 0.20–0.30. Higher drowns out the agent's voice, especially on phone calls.
For outbound to IVR, leave Ambient off — the machine on the other end doesn't need to be fooled.
For web demos to investors / customers, the Office or Crowded room clip is usually the most natural-sounding.

Uses local files shipped with livekit-agents — no API or per-minute cost.

"Latency feels high — where's the time going?"

Watch the live Latency Breakdown card during a call. Numbers come from real metrics.

STT > 500ms → switch from Sarvam Saaras to Deepgram Nova-3 (multilingual) or Flux (English).
EOU > 1s → tune Pipeline A above. Common cause: Response rate set to Balanced/Patient with no Endpointing override.
LLM > 1.5s TTFT → switch from gpt-4o to gpt-4o-mini, or Llama-3.3-70B on Groq (sub-200ms TTFT).
TTS > 400ms TTFB → use Sarvam Bulbul (real-time) instead of OpenAI TTS (batch).

"Cheaper / faster Indian voices"

Smallest.ai's Pulse STT is the cheapest in the dropdown (~₹0.48/min vs Deepgram ~₹0.46/min for nova-3-en) and Lightning v3.1 TTS supports Hindi + code-mixing.

Switch STT to Smallest Pulse for low cost. Pulse Realtime trades ~₹0.30/min more for ~50ms less TTFT.
Switch TTS to Smallest Lightning v3.1. Voices fetched live from their API — pick one in the voice modal.
For pure Hindi calls, set language=hi in Smallest TTS advanced. For Hinglish, leave language=en (the model handles code-mixing).

First call may add ~150ms while their CDN warms up — subsequent calls are sub-200ms.

"STT misses words when the language isn't English"

Deepgram Flux on flux-general-multi with no language hint sometimes misidentifies Hindi (or other Indic) as Russian/Portuguese, causing the transcript to be garbage. The fix is the Detect-then-lock pattern.

Open STT → Deepgram (advanced).
Set Language strategy → Detect then lock (the default for the saved config).
Turn on the Initial hints for languages you expect — for India deployments, en + hi.
On every live call, the badge next to the timer shows the wrapper's current state: yellow eye = detected (still observing), teal lock = locked to a single language for the rest of the call.
If the caller switches languages mid-call, the wrapper re-locks after ≥2 consecutive turns in the new language (one stray turn doesn't flip it — avoids spurious WS reconnects on noise).

Reconnect cost: ~200ms WS handshake when the lock fires (paid in the between-turn silence — the caller hears nothing). After that, every subsequent turn benefits from monolingual-grade accuracy for the locked language. Static mode skips this entirely; Auto-detect mode never locks.

"Agent cuts me off when I say 'umm'"

The STT is committing end-of-turn during the brief pause before the filler word lands.

Open STT → Deepgram (advanced) in the Models column.
If using Nova — turn Filler words ON and bump DG endpointing → 400ms (default 25ms is way too aggressive).
If using Flux — raise EOT threshold → 0.8 and EOT timeout → 4000ms so Flux waits longer before declaring end-of-turn. Leave Eager EOT blank.
Raise Turn-detector threshold → 0.25–0.35 so the semantic model is more cautious during borderline pauses.
Optionally raise Linear delay → 200ms for extra breathing room.

Works best with Hindi / multilingual callers who naturally use "matlab" / "yaani" / "haan" as fillers.

How saving works

Every knob on this page round-trips through configs/default.json:

Tweak any knob in the UI.
Click Save as default in the header.
The config is persisted; the next call (web or phone) loads it.
Refresh the page → the UI reads back from default.json, so what you saved is what you see.

Phone calls (Plivo / Twilio) don't carry per-call overrides, so they always use the saved default. For web calls, the UI can also send overrides per-call without saving.

Speech to text

Language model

Text to speech

Persona & voice

Turn & language

Per-minute cost

Conversation behavior

Response

Welcome

Call management

Ambient soundscape

Other

Greeting (opening line)

Prompt prefix (auto)

Base prompt

Tools

Latency breakdown

Live calls

Active calls