
Jul 29, 2025

Anyone building voice AI agents knows how hard it is to stay up-to-date with the latest text-to-speech voice models.
We spend time testing and experimenting with all of the available paid and open-source text-to-speech voice AI models. Our goal with this post was to consolidate our own notes and experience testing different models into a single resource for developers evaluating multiple models.
If you see any missing models, or details or nuance we haven’t captured, please let us know and we’ll update the post.
For those who may be new to voice AI, we’ll start with a quick primer on the current state of text-to-speech voice AI models. (If you’re just looking for the model comparison, skip the next section).
Text-to-speech voice models are improving rapidly
A year ago the only reliable way to add a fluid, natural-sounding voice to an LLM-powered AI agent in production was to call a model provider’s API and accept the cost, latency and vendor lock-in associated with choosing a cloud service.
Today, things look quite different: While the quality of speech these models are capable of generating has improved tremendously, open-source models like Coqui XTTS v2.0.3, Canopy Labs’ Orpheus and Hexgrad’s Kokoro 82 M have developed in lockstep: in blind tests, most listeners can’t reliably separate them from the incumbents.
Broadly, today's models fall into two distinct categories that serve fundamentally different purposes:
Real-time models like Cartesia Sonic, ElevenLabs Flash, and Hexgrad Kokoro prioritize streaming audio generation, producing speech as text arrives rather than waiting for complete sentences. These models excel in conversational AI where low-latency makes the difference between natural dialogue and awkward pauses. These models are often architected for immediate response but may sacrifice some prosodic quality for speed.
Non-real-time models like Dia 1.6B and Coqui XTTS take the opposite approach: processing entire text passages to optimize for naturalness, emotion, and overall speech quality. They're ideal for content creation, audiobook narration, or any application where the extra processing time translates to noticeably better output quality.
This architectural difference explains why you'll see such variation in the latency figures across our comparison table — it's not just about optimization, but the models’ fundamental design and intended purpose.
Making sense of voice AI latency metrics
When evaluating model speed, you'll often encounter "TTFB" (Time To First Byte): This measures how long it takes from sending your text request to receiving the first chunk of audio data back (essentially, how quickly you hear the voice start speaking). This metric is crucial for real-time applications because it directly impacts the responsiveness users experience.
For context, human conversation typically has response delays under 200ms, so TTFB figures above this threshold start to feel unnatural in conversational AI. However, TTFB only tells part of the story: total processing time for longer passages and the consistency of streaming quality matter just as much for overall user experience. The nuances of conversational latency is a separate, deep (and fascinating) topic. More on that soon!
Why great voice models don’t = great voice AI products
In short, latency and cost (the twin hurdles that kept real-time speech out of most production roadmaps) have been significantly and meaningfully reduced in the last 12 months.
But, cheap, fast, high-quality voices alone don’t automatically translate into great real-time conversational products. A production-grade agent still needs to capture microphone audio, gate and normalise it, transcribe it in real time, pass clean text to an LLM or custom backend, stream the response to the chosen TTS, then return the synthesized audio with no audible gaps — all while handling disconnects, turn-taking, silence detection, regional scaling and usage accounting.
That complexity is exactly where many developers start to get a headache when looking at building production-ready voice AI, even if they have a great application in mind.
With open models now offering incredible speech quality and rich emotion, matching the leaders on quality (and often beating them on speed), the main competitive frontier is infrastructure: who can deliver those voices, at scale, with the lowest latency and the least friction?
We’re building Layercode to eliminate this complexity from the equation: Handling all of the plumbing required to power production-ready low-latency voice agents (read more about how here).
Layercode is currently in beta, and we’re working to integrate as many model providers as we can. If you are working on one of the voice models we’ve tested for this post — or one we haven’t — we’d love to explore an integration.
Comparing today’s leading text-to-speech voice models
Beyond the real-time VS non real-time distinction, there's significant nuance to consider when evaluating text-to-speech voice models for your specific use case.
Customer service bots and phone-based agents benefit from the ultra-low latency of real-time models like Cartesia Sonic or ElevenLabs Flash, where every millisecond of delay affects conversation flow. Content creation workflows—podcast generation, audiobook narration, or video voiceovers—can leverage the superior quality of non-real-time models like Dia 1.6B and Eleven Multilingual v2, where processing time matters less than the final output.
In our experience, new models' marketing claims don't always match our direct experience testing and using the models in production scenarios.
The comparison table below shows how these models stack up across key technical dimensions, followed by our hands-on experience with each model. Note how the real-time models cluster around the 40-200ms TTFB range, while non-real-time models prioritize quality over speed.
To keep this resource focused on the most practical concern for developers, we've ranked the subsequent list of models by overall voice quality, as experienced by end users.
ElevenLabs Flash v2.5
ElevenLabs Flash v2.5 is a popular, ultra-low-latency multilingual text-to-speech voice model, Flash is ElevenLabs' fastest model — ideal for real-time voice agents. Flash 2.5 boasts sub-100ms TTFB in 30+ languages, and preserves high voice quality. The model also sets the bar for high-fidelity 5-second voice cloning.
✅ Pros | ❌ Cons |
---|---|
Very fast start-up (~75 ms) | 2nd highest cost (of evaluated models) |
Best-in-class multilingual voice cloning quality | Closed ecosystem |
OpenAI GPT-4o mini TTS
OpenAI's GPT-40 mini TTS is text-to-speech model built on GPT-4o mini that supports 32 languages and a wide range of customizable expressions via prompting. Average TTFB hovers at just under a quarter-second.
✅ Pros | ❌ Cons |
---|---|
Tight integration with OpenAI toolchain | No voice cloning |
Prompt-level style control |
Deepgram Aura-2
Deepgram's Aura-2 TTS model targets enterprise voice agents that boasts <200ms TTFB latency and simple, per-character pricing. However, it only offers two languages, and doesn't support voice cloning.
✅ Pros | ❌ Cons |
---|---|
Affordable pricing for high call volumes | English and Spanish only |
Fast start-up (<200ms) | No voice cloning |
Cartesia Sonic 2.0
Cartesia's Sonic 2.0 is one of the fastest engines on the market. Turbo mode can achieve ~40 ms TTFB. Sonic 2.0 also supports 15 realistic voices out of the box, and supports instant voice cloning.
✅ Pros | ❌ Cons |
---|---|
Ultra-low latency & 15 languages | Closed-source |
Instant, good quality voice cloning |
Rime Mist v2
Rime's Mist v2 TTS model delivers impressively fast (sub-100 ms TTFB) on-prem performance, fast (~175ms TTFB) cloud hosted performance. The model is available in English and Spanish and offers enterprise-grade voice cloning. This model is optimized for real-time business uses, built for scale (no concurrency limit)
✅ Pros | ❌ Cons |
---|---|
Consistently low latency & no concurrency limits | Only two languages |
Professional cloning tier for brand voices | Good but not excellent voice quality |
Rime Arcana
Arcana is Rime's newest spoken language model, emphasizing more realistic, expressive speech than Mist v2, but at slower speeds (~250ms TTFB). Impressive emotion tags like <laugh> and <sigh> offer substantial control over the voice expression .
✅ Pros | ❌ Cons |
---|---|
Expressive, natural sounding speech | Slower than Mist or Sonic |
Highly customizable via emotion tags | Closed-source |
Orpheus (Canopy Labs)
Canopy Labs' Orpheus TTS model is an MIT-licensed multilingual model that offers tag-based emotional control and zero-shot cloning at ~200ms TTFB latency.
✅ Pros | ❌ Cons |
---|---|
Free, open weights with permissive licence | Comparatively difficult to set up |
Solid language spread for an OSS model |
Dia 1.6B
Dia is a popular open-source TTS voice model by Nari Labs that does a very impressive job at producing high quality, podcast-style audio featuring multiple speakers. Dia also offers voice cloning capabilities.
✅ Pros | ❌ Cons |
---|---|
One of the most lifelike voice and pacing | No real time support |
Fully open for research or batch synthesis | English only out of the box |
Sesame CSM-1B
Sesame CSM-1B is an Apache-2 model optimized for interactive voice agents. This open-source model is notably distinct from the model used in their live (and viral) sesame.com demo. In our testing, the CSM-1B model produces lower quality and far less emotionally rich speech.
✅ Pros | ❌ Cons |
---|---|
Free & self-hostable | Not as impressive as Sesame's viral demo |
Basic voice cloning |
Coqui XTTS v2.0.3
Coqui XTTS-v2.0.3 is an open-source TTS model that supports 17-languages support, claims <200ms TTFB (on suitable hardware). The model also supports 3-second zero-shot cloning.
✅ Pros | ❌ Cons |
---|---|
Broadest language support of open-source models | Commercial use requires paid licence |
Voice cloning with only a few seconds of audio | Needs a strong GPU to achieve <200ms |
Hexgrad Kokoro 82 M
Kokoro is an open-weight TTS model with 82M parameters. Kokoro is a very cost-effective model (under $1 per million characters of text input, or under $0.06 per hour of audio output).
✅ Pros | ❌ Cons |
---|---|
Ultra-light and extremely fast | No built-in cloning capability |
Very affordable hosted pricing | Smaller training set can = artifacts (rare) |
Chatterbox (Resemble AI)
Resemble AI's Chatterbox is an open-source, English TTS model that can achieve <200ms TTFB latency (on suitable hardware). Chatterbox can clone a new voice from a 5-second sample and supports simple emotion prompts.
✅ Pros | ❌ Cons |
---|---|
Open weights with permissive licence | English only for now |
A higher quality open-source option | Very young project |
Unmute (Kyutai)
Unmute combines low-latency TTS (≈ 220ms), STT and turn-taking in one MIT-licensed stack, with 10-second cloning and English/French support. The voice quality is high and testing with various accents yielded strong results.
✅ Pros | ❌ Cons |
---|---|
Complete open pipeline | Limited language coverage so far |
High quality english speaking accented voices | Also new: docs and tooling still evolving |
Fluxions AI
Fluxions is an open-source TTS model that targets ~200ms first-audio (with reports of 160ms when running on high-end GPUs). Fluxions includes a basic voice-cloning feature, though the clone quality substantially trails leaders like ElevenLabs.
✅ Pros | ❌ Cons |
---|---|
Free, permissive MIT licence with easy self-hosting | Voice cloning quality lags |
Good quality voice | Limited language support |
What did we miss?
If we missed a model, or you notice a detail that has changed, nuance we didn’t capture, or something else we should include in this guide, we’d love to hear from you.