Pick your stack on the left, customize prompts in the middle, talk on the right. Live latency, real per-call cost, transcript.
Per-stage time-to-first-token / first-audio-byte, measured during the call. Empty until the first turn lands.
Two independent pipelines control conversation flow. Tuning one doesn't affect the other.
After the user stops speaking, the agent waits, watches a turn-detector ML model, then commits to reply.
wait = MIN(
Take turn,
Response rate floor + Linear delay
) + turn-detector hesitation
where Response rate floor =
Rapid → 200ms · Balanced → 500ms · Patient → 1000ms
Custom → whatever you type in Endpointing
There's also a ~250ms VAD silence floor underneath everything (Silero default, hardcoded). It stacks BEFORE this formula starts counting.
While the agent is speaking, the user's voice may cut it off. Cutoff requires ALL of:
AEC warmup has elapsed
user speech ≥ Min interrupt
user words ≥ Min words
| Setting | What it does | Default | Snappy | Phone | Patient |
|---|---|---|---|---|---|
| Response speed (Pipeline A) | |||||
| Response rate | Minimum silence wait after you stop speaking before the agent replies. Rapid=200ms, Balanced=500ms, Patient=1000ms, Custom=reveals an Endpointing input where you type the exact ms. | Balanced | Rapid | Balanced | Patient |
| Endpointing (ms) | Only visible when Response rate = Custom. Sets the exact silence floor in milliseconds. | — | — | — | — |
| Linear delay (ms) | Padding added ON TOP of the floor. Lets you give extra breathing room for thinkers without changing the preset. | 0 | 0 | 0 | 200 |
| Take turn (ms) | Ceiling. If turn-detector ML stays uncertain this long, force-commit the turn anyway. | 2000 | 1000 | 1500 | 3000 |
| Turn-detector threshold | Confidence (0–1) the semantic EOU model needs before committing a turn within min_delay. Below threshold the session keeps waiting up to Take turn. Empty = use the model's per-language default. |
auto | auto | auto | 0.35 |
| User interruption — how callers cut into the agent (Pipeline B) | |||||
| AEC warmup (sec) | After call starts, NO interrupts allowed for this long while echo-cancellation converges. Lower = snappier early interrupts, higher = cleaner audio. | 0.5 | 0.3 | 1.0 | 1.5 |
| Min interrupt (ms) | Your speech must last at least this long to count as an interruption. Lower = easier interrupt, more false positives. | 300 | 200 | 400 | 600 |
| Min words | Your transcript must contain at least this many words. 1 = even "stop" interrupts. |
2 | 1 | 2 | 3 |
| Resume false intrr. | If you triggered an interrupt but stopped quickly, resume the agent's cut-off reply automatically. | ON | ON | ON | ON |
| Resume after (sec) | How long to wait before resuming the cut-off reply. | 1.5 | 1.5 | 1.5 | 2.0 |
| Other | |||||
| Preemptive LLM | Start the LLM call before turn-detector fully commits. Saves ~250 ms per turn. | ON | ON | ON | ON |
| IVR detection | Detect "Press 1 for sales" auto-attendants. Mainly useful for outbound calls to call centres. | OFF | OFF | OFF | OFF |
| Welcome delay (ms) | Sleep this long before the agent's first greeting. Useful when phone audio path needs a moment to settle. | 0 | 0 | 500 | 0 |
| Hangup on silence | Auto-disconnect if the user is silent for the threshold. Useful for unmanned outbound. | OFF | OFF | OFF / 30s | OFF |
| Online check msg / after | Probe the user with a short message after N seconds of silence ("Are you still there?"). Fires before hangup. | "Are you still there?" / 9s | — | "Are you still there?" / 10s | "Still there?" / 20s |
| Max duration (sec) | Hard ceiling on call length. Defensive against runaway calls. | 600 | 600 | 900 | 900 |
| STT — Deepgram Nova advanced | |||||
| DG endpointing (ms) | Silence Deepgram waits before emitting its own end-of-utterance hint. The plugin's built-in 25ms is far too aggressive — Deepgram themselves recommend 300–500ms. Lower = snappier but more mid-sentence cutoffs on slow speakers. | 300 | 200 | 300 | 500 |
| DG filler words | Keep "um"/"uh"/"hmm" in the transcript. When OFF those tokens get stripped and the turn-detector sees a "done" transcript while the user is still hesitating — the #1 cause of premature EOU. | ON | ON | ON | ON |
| DG smart format | Add punctuation, numeric formatting, etc. to transcripts. Improves prompt readability and unlocks future punctuation-based endpointing modes. | ON | ON | ON | ON |
| STT — Deepgram Flux advanced (server-side EOU) | |||||
| EOT threshold | Confidence (0.5–0.9) Flux's own end-of-turn model needs before declaring EOT. Plugin default 0.7. Lower = snappier but more mid-sentence cutoffs. | 0.7 | 0.5 | 0.7 | 0.8 |
| EOT timeout (ms) | Maximum ms Flux waits for a confident EOT before forcing one. Plugin default 3000. Lowering trims worst-case latency at the cost of cutting off slow speakers. | 3000 | 2000 | 3000 | 4500 |
| Eager EOT | Early-fire confidence (0.3–0.9) that lets Flux emit a "likely end of turn" signal ahead of the final EOT, so preemptive LLM generation can kick in sooner. Must be ≤ EOT threshold. Blank = disabled. | off | 0.3 | off | off |
Expected EOU drop: 1.5s → 0.6s
Risk: more false positives. Coughs, breaths, background TV → may cut off the agent. Bump back up if it happens.
Trade-off: slower replies. Use only if users are elderly / non-native / pause a lot.
Watch the live Latency Breakdown card during a call. Numbers come from real metrics.
The STT is committing end-of-turn during the brief pause before the filler word lands.
Works best with Hindi / multilingual callers who naturally use "matlab" / "yaani" / "haan" as fillers.
Every knob on this page round-trips through configs/default.json:
default.json, so what you saved is what you see.Phone calls (Plivo / Twilio) don't carry per-call overrides, so they always use the saved default. For web calls, the UI can also send overrides per-call without saving.
Placeholders like {agent_name}
found in your greeting / prefix / base prompt. Whatever you enter here
replaces them before the call is sent — so the same template works for
many calls.
Describe what the voice agent should do — its role, the goal of the call, key flow notes. The LLM expands this into a polished prompt; voice-call rules, language and persona are added automatically as the prefix.