Speech
How transcription and speech synthesis work behind the scenes
Transcription
We use a combination of Deepgram and Whisper for our transcription pipeline. At the moment, we don’t support other STT providers.
Our pipeline includes:
- Noise cancellation: Filters background interference
- Voice isolation: Focuses on the speaker, even in noisy environments
- Turn-taking detection: Helps the agent know when to speak vs. listen
We continuously improve the pipeline, and updates are rolled out automatically.
If you’re experiencing low transcription quality, let us know. We can explore fine-tuning options or support bring-your-own models.
Speech synthesis
We offer six high-quality built-in voices, each professionally cloned and tuned for:
- Natural prosody and intonation
- Accurate pronunciation
- Low-latency streaming
Each voice is replicated across multiple model providers to ensure reliability. If one provider goes down, we can switch seamlessly with no interruption. We recommend starting with a built-in voice for most deployments.
In case you’d like to use a custom voice, we support ElevenLabs and Cartesia. Get in touch if you’d like to bring your own.