Text-to-Speech voice AI model guide 2025

$2,000 free credits for voice AI Startups

Apply to Startup Program →

$2,000 free credits

Apply to Startup Program →

Start Building

Jul 29, 2025

Voice AI

Aidan Hornsby

Anyone building voice AI agents knows how hard it is to stay up-to-date with the latest text-to-speech voice models.

We spend time testing and experimenting with all of the available paid and open-source text-to-speech voice AI models. Our goal with this post was to consolidate our own notes and experience testing different models into a single resource for developers evaluating multiple models.

If you see any missing models, or details or nuance we haven’t captured, please let us know and we’ll update the post.

For those who may be new to voice AI, we’ll start with a quick primer on the current state of text-to-speech voice AI models. (If you’re just looking for the model comparison, skip the next section).

Text-to-speech voice models are improving rapidly

A year ago the only reliable way to add a fluid, natural-sounding voice to an LLM-powered AI agent in production was to call a model provider’s API and accept the cost, latency and vendor lock-in associated with choosing a cloud service.

Today, things look quite different: While the quality of speech these models are capable of generating has improved tremendously, open-source models like Coqui XTTS v2.0.3, Canopy Labs’ Orpheus and Hexgrad’s Kokoro 82 M have developed in lockstep: in blind tests, most listeners can’t reliably separate them from the incumbents.

Broadly, today's models fall into two distinct categories that serve fundamentally different purposes:

Real-time models like Cartesia Sonic, ElevenLabs Flash, and Hexgrad Kokoro prioritize streaming audio generation, producing speech as text arrives rather than waiting for complete sentences. These models excel in conversational AI where low-latency makes the difference between natural dialogue and awkward pauses. These models are often architected for immediate response but may sacrifice some prosodic quality for speed.
High fidelity models like Dia 1.6B and Coqui XTTS take the opposite approach: processing entire text passages to optimize for naturalness, emotion, and overall speech quality. They're ideal for content creation, audiobook narration, or any application where the extra processing time translates to noticeably better output quality.

This architectural difference explains why you'll see such variation in the latency figures across our comparison table — it's not just about optimization, but the models’ fundamental design and intended purpose.

Making sense of voice AI latency metrics

When evaluating model speed, you'll often encounter "TTFB" (Time To First Byte): This measures how long it takes from sending your text request to receiving the first chunk of audio data back (essentially, how quickly you hear the voice start speaking). This metric is crucial for real-time applications because it directly impacts the responsiveness users experience.

For context, human conversation typically has response delays under 200ms, so TTFB figures above this threshold start to feel unnatural in conversational AI. However, TTFB only tells part of the story: total processing time for longer passages and the consistency of streaming quality matter just as much for overall user experience. The nuances of conversational latency is a separate, deep (and fascinating) topic. More on that soon!

Why great voice models don’t = great voice AI products

In short, latency and cost (the twin hurdles that kept real-time speech out of most production roadmaps) have been significantly and meaningfully reduced in the last 12 months.

But, cheap, fast, high-quality voices alone don’t automatically translate into great real-time conversational products. A production-grade agent still needs to capture microphone audio, gate and normalise it, transcribe it in real time, pass clean text to an LLM or custom backend, stream the response to the chosen TTS, then return the synthesized audio with no audible gaps — all while handling disconnects, turn-taking, silence detection, regional scaling and usage accounting.

That complexity is exactly where many developers start to get a headache when looking at building production-ready voice AI, even if they have a great application in mind.

With open models now offering incredible speech quality and rich emotion, matching the leaders on quality (and often beating them on speed), the main competitive frontier is infrastructure: who can deliver those voices, at scale, with the lowest latency and the least friction?

We’re building Layercode to eliminate this complexity from the equation: Handling all of the plumbing required to power production-ready low-latency voice agents (read more about how here).

Layercode is currently in beta, and we’re working to integrate as many model providers as we can. If you are working on one of the voice models we’ve tested for this post — or one we haven’t — we’d love to explore an integration.

Comparing today’s leading text-to-speech voice models

Beyond the real-time VS non real-time distinction, there's significant nuance to consider when evaluating text-to-speech voice models for your specific use case.

Customer service bots and phone-based agents benefit from the ultra-low latency of real-time models like Cartesia Sonic or ElevenLabs Flash, where every millisecond of delay affects conversation flow. Content creation workflows—podcast generation, audiobook narration, or video voiceovers—can leverage the superior quality of non-real-time models like Dia 1.6B and Eleven Multilingual v2, where processing time matters less than the final output.

In our experience, new models' marketing claims don't always match our direct experience testing and using the models in production scenarios.

The comparison table below shows how these models stack up across key technical dimensions, followed by our hands-on experience with each model. Note how the real-time models cluster around the 40-200ms TTFB range, while non-real-time models prioritize quality over speed.

To keep this resource focused on the most practical concern for developers, we've ranked the subsequent list of models by overall voice quality, as experienced by end users.

ElevenLabs Flash v2.5

ElevenLabs Flash v2.5 is a popular, ultra-low-latency multilingual text-to-speech voice model. Flash is ElevenLabs' fastest model — ideal for real-time voice agents. Flash 2.5 boasts sub-100ms TTFB in 30+ languages, and preserves high voice quality. The model also sets the bar for high-fidelity 5-second voice cloning.

✅ Pros	❌ Cons
Very fast start-up (~75 ms)	2nd highest cost (of evaluated models)
Best-in-class multilingual voice cloning quality	Closed ecosystem

OpenAI GPT-4o mini TTS

OpenAI's GPT-40 mini TTS is text-to-speech model built on GPT-4o mini that supports 32 languages and a wide range of customizable expressions via prompting. Average TTFB hovers at just under a quarter-second.

✅ Pros	❌ Cons
Tight integration with OpenAI toolchain	No voice cloning
Prompt-level style control

Deepgram Aura-2

Deepgram's Aura-2 TTS model targets enterprise voice agents. Aura-2 boasts <200ms TTFB latency and simple, per-character pricing. However, it only offers two languages, and doesn't support voice cloning.

✅ Pros	❌ Cons
Affordable pricing for high call volumes	English and Spanish only
Fast start-up (<200ms)	No voice cloning

Cartesia Sonic 2.0

Cartesia's Sonic 2.0 is one of the fastest engines on the market. Turbo mode can achieve ~40 ms TTFB. Sonic 2.0 also supports 15 realistic voices out of the box, and supports instant voice cloning.

✅ Pros	❌ Cons
Ultra-low latency & 15 languages	Closed-source
Instant, good quality voice cloning

Rime Mist v2

Rime's Mist v2 TTS model delivers impressively fast (sub-100 ms TTFB) on-prem performance, fast (~175ms TTFB) cloud hosted performance. The model is available in English and Spanish and offers enterprise-grade voice cloning. This model is optimized for real-time business uses, built for scale (no concurrency limit)

✅ Pros	❌ Cons
Consistently low latency & no concurrency limits	Only two languages
Professional cloning tier for brand voices	Good but not excellent voice quality

Rime Arcana

Arcana is Rime's newest spoken language model, emphasizing more realistic, expressive speech than Mist v2, but at slower speeds (~250ms TTFB). Impressive emotion tags like <laugh> and <sigh> offer substantial control over the voice expression .

✅ Pros	❌ Cons
Expressive, natural sounding speech	Slower than Rime's Mist model
Highly customizable via emotion tags	Closed-source

Orpheus (Canopy Labs)

Canopy Labs' Orpheus TTS model is an MIT-licensed multilingual model that offers tag-based emotional control and zero-shot cloning at ~200ms TTFB latency.

✅ Pros	❌ Cons
Free, open weights with permissive licence	Nontrivial to set up & configure
7 languages (impressive for open-source)

Dia 1.6B

Dia is a popular open-source TTS voice model by Nari Labs that does a very impressive job at producing high quality, podcast-style audio featuring multiple speakers. Dia also offers voice cloning capabilities.

✅ Pros	❌ Cons
One of the most lifelike voice and pacing	No real-time support
Fully open for research or batch synthesis	English only

Sesame CSM-1B

Sesame CSM-1B is an Apache-2 model optimized for interactive voice agents. This open-source model is notably distinct from the model used in their live (and viral) sesame.com demo. In our testing, the CSM-1B model produces lower quality and far less emotionally rich speech.

✅ Pros	❌ Cons
Free & self-hostable	Not as impressive as Sesame's viral demo
Basic voice cloning

Coqui XTTS v2.0.3

Coqui XTTS-v2.0.3 is an open-source TTS model that supports 17-languages support, claims <200ms TTFB (on suitable hardware). The model also supports 3-second zero-shot cloning.

✅ Pros	❌ Cons
Broadest language support of open-source models	Commercial use requires paid licence
Voice cloning with only a few seconds of audio	Needs a powerful GPU to achieve <200ms

Hexgrad Kokoro 82 M

Kokoro is an open-weight TTS model with 82M parameters. Kokoro is a very cost-effective model (under $1 per million characters of text input, or under $0.06 per hour of audio output).

✅ Pros	❌ Cons
Ultra-light and extremely fast	No built-in cloning capability
Very affordable hosted pricing	Smaller training set can = artifacts (rare)

Chatterbox (Resemble AI)

Resemble AI's Chatterbox is an open-source, English TTS model that can achieve <200ms TTFB latency (on suitable hardware). Chatterbox can clone a new voice from a 5-second sample and supports simple emotion prompts.

✅ Pros	❌ Cons
Open weights with permissive licence	English only for now
A higher quality open-source option	Very early project

Unmute (Kyutai)

Unmute combines low-latency TTS (≈ 220ms), STT and turn-taking in one MIT-licensed stack, with 10-second cloning and English/French support. The voice quality is high and testing with various accents yielded strong results.

✅ Pros	❌ Cons
Complete open pipeline	Limited language coverage so far
High quality english speaking accented voices	Also new: docs and tooling still evolving

Fluxions AI

Fluxions is an open-source TTS model that targets ~200ms first-audio (with reports of 160ms when running on high-end GPUs). Fluxions includes a basic voice-cloning feature, though the clone quality substantially trails leaders like ElevenLabs.

✅ Pros	❌ Cons
Free, permissive MIT licence with easy self-hosting	Voice cloning quality lags
Good quality voice	Limited language support

What did we miss?

If we missed a model, or you notice a detail that has changed, nuance we didn’t capture, or something else we should include in this guide, we’d love to hear from you.

‹ How to write prompts for voice AI agents

Beta Updates: lower pricing, phone support, a new voice model, and more ›