Jul 29, 2025
Voice AI
Voice AI

Text-to-Speech Voice Model Guide 2025

Text-to-Speech Voice Model Guide 2025

Aidan Hornsby
Aidan Hornsby

Anyone building voice AI agents knows how hard it is to stay up-to-date with the latest text-to-speech voice models.

We spend time testing and experimenting with all of the available paid and open-source text-to-speech voice AI models. Our goal with this post was to consolidate our own notes and experience testing different models into a single resource for developers evaluating multiple models.

If you see any missing models, or details or nuance we haven’t captured, please let us know and we’ll update the post.

For those who may be new to voice AI, we’ll start with a quick primer on the current state of text-to-speech voice AI models. (If you’re just looking for the model comparison, skip the next section).

Text-to-speech voice models are improving rapidly

A year ago the only reliable way to add a fluid, natural-sounding voice to an LLM-powered AI agent in production was to call a model provider’s API and accept the cost, latency and vendor lock-in associated with choosing a cloud service.

Today, things look quite different: While the quality of speech these models are capable of generating has improved tremendously, open-source models like Coqui XTTS v2.0.3, Canopy Labs’ Orpheus and Hexgrad’s Kokoro 82 M have developed in lockstep: in blind tests, most listeners can’t reliably separate them from the incumbents.

Broadly, today's models fall into two distinct categories that serve fundamentally different purposes: 

  1. Real-time models like Cartesia Sonic, ElevenLabs Flash, and Hexgrad Kokoro prioritize streaming audio generation, producing speech as text arrives rather than waiting for complete sentences. These models excel in conversational AI where low-latency makes the difference between natural dialogue and awkward pauses. These models are often architected for immediate response but may sacrifice some prosodic quality for speed.

  2. Non-real-time models like Dia 1.6B and Coqui XTTS take the opposite approach: processing entire text passages to optimize for naturalness, emotion, and overall speech quality. They're ideal for content creation, audiobook narration, or any application where the extra processing time translates to noticeably better output quality.

This architectural difference explains why you'll see such variation in the latency figures across our comparison table — it's not just about optimization, but the models’ fundamental design and intended purpose.

Making sense of voice AI latency metrics

When evaluating model speed, you'll often encounter "TTFB" (Time To First Byte): This measures how long it takes from sending your text request to receiving the first chunk of audio data back (essentially, how quickly you hear the voice start speaking). This metric is crucial for real-time applications because it directly impacts the responsiveness users experience.

For context, human conversation typically has response delays under 200ms, so TTFB figures above this threshold start to feel unnatural in conversational AI. However, TTFB only tells part of the story: total processing time for longer passages and the consistency of streaming quality matter just as much for overall user experience. The nuances of conversational latency is a separate, deep (and fascinating) topic. More on that soon!

Why great voice models don’t = great voice AI products

In short, latency and cost (the twin hurdles that kept real-time speech out of most production roadmaps) have been significantly and meaningfully reduced in the last 12 months.

But, cheap, fast, high-quality voices alone don’t automatically translate into great real-time conversational products. A production-grade agent still needs to capture microphone audio, gate and normalise it, transcribe it in real time, pass clean text to an LLM or custom backend, stream the response to the chosen TTS, then return the synthesized audio with no audible gaps — all while handling disconnects, turn-taking, silence detection, regional scaling and usage accounting.  

That complexity is exactly where many developers start to get a headache when looking at building production-ready voice AI, even if they have a great application in mind.

With open models now offering incredible speech quality and rich emotion, matching the leaders on quality (and often beating them on speed), the main competitive frontier is infrastructure: who can deliver those voices, at scale, with the lowest latency and the least friction? 

We’re building Layercode to eliminate this complexity from the equation: Handling all of the plumbing required to power production-ready low-latency voice agents (read more about how here). 

Layercode is currently in beta, and we’re working to integrate as many model providers as we can. If you are working on one of the voice models we’ve tested for this post — or one we haven’t — we’d love to explore an integration.

Comparing today’s leading text-to-speech voice models

Beyond the real-time VS non real-time distinction, there's significant nuance to consider when evaluating text-to-speech voice models for your specific use case.

Customer service bots and phone-based agents benefit from the ultra-low latency of real-time models like Cartesia Sonic or ElevenLabs Flash, where every millisecond of delay affects conversation flow. Content creation workflows—podcast generation, audiobook narration, or video voiceovers—can leverage the superior quality of non-real-time models like Dia 1.6B and Eleven Multilingual v2, where processing time matters less than the final output.

In our experience, new models' marketing claims don't always match our direct experience testing and using the models in production scenarios. 

The comparison table below shows how these models stack up across key technical dimensions, followed by our hands-on experience with each model. Note how the real-time models cluster around the 40-200ms TTFB  range, while non-real-time models prioritize quality over speed.

To keep this resource focused on the most practical concern for developers, we've ranked the subsequent list of models by overall voice quality, as experienced by end users.

  1. ElevenLabs Flash v2.5

ElevenLabs Flash v2.5 is a popular, ultra-low-latency multilingual text-to-speech voice model, Flash is ElevenLabs' fastest model — ideal for real-time voice agents. Flash 2.5 boasts sub-100ms TTFB in 30+ languages, and preserves high voice quality. The model also sets the bar for high-fidelity 5-second voice cloning.

✅ Pros

❌ Cons

Very fast start-up (~75 ms)

2nd highest cost (of evaluated models)

Best-in-class multilingual voice cloning quality

Closed ecosystem

  1. OpenAI GPT-4o mini TTS

OpenAI's GPT-40 mini TTS is text-to-speech model built on GPT-4o mini that supports 32 languages and a wide range of customizable expressions via prompting. Average TTFB hovers at just under a quarter-second.

✅ Pros

❌ Cons

Tight integration with OpenAI toolchain

No voice cloning

Prompt-level style control


  1. Deepgram Aura-2

Deepgram's Aura-2 TTS model targets enterprise voice agents that boasts <200ms TTFB latency and simple, per-character pricing. However, it only offers two languages, and doesn't support voice cloning.

✅ Pros

❌ Cons

Affordable pricing for high call volumes

English and Spanish only

Fast start-up (<200ms)

No voice cloning

  1. Cartesia Sonic 2.0

Cartesia's Sonic 2.0 is one of the fastest engines on the market. Turbo mode can achieve ~40 ms TTFB. Sonic 2.0 also supports 15 realistic voices out of the box, and supports instant voice cloning.

✅ Pros

❌ Cons

Ultra-low latency & 15 languages

Closed-source

Instant, good quality voice cloning


  1. Rime Mist v2

Rime's Mist v2 TTS model delivers impressively fast (sub-100 ms TTFB) on-prem performance, fast (~175ms TTFB) cloud hosted performance. The model is available in English and Spanish and offers enterprise-grade voice cloning. This model is optimized for real-time business uses, built for scale (no concurrency limit)

✅ Pros

❌ Cons

Consistently low latency & no concurrency limits

Only two languages

Professional cloning tier for brand voices

Good but not excellent voice quality

  1. Rime Arcana

Arcana is Rime's newest spoken language model, emphasizing more realistic, expressive speech than Mist v2, but at slower speeds (~250ms TTFB). Impressive emotion tags like <laugh> and <sigh> offer substantial control over the voice expression .

✅ Pros

❌ Cons

Expressive, natural sounding speech

Slower than Mist or Sonic

Highly customizable via emotion tags

Closed-source

  1. Orpheus (Canopy Labs)

Canopy Labs' Orpheus TTS model is an MIT-licensed multilingual model that offers tag-based emotional control and zero-shot cloning at ~200ms TTFB latency.

✅ Pros

❌ Cons

Free, open weights with permissive licence

Comparatively difficult to set up

Solid language spread for an OSS model


  1. Dia 1.6B

Dia is a popular open-source TTS voice model by Nari Labs that does a very impressive job at producing high quality, podcast-style audio featuring multiple speakers. Dia also offers voice cloning capabilities.

✅ Pros

❌ Cons

One of the most lifelike voice and pacing

No real time support

Fully open for research or batch synthesis

English only out of the box

  1. Sesame CSM-1B

Sesame CSM-1B is an Apache-2 model optimized for interactive voice agents. This open-source model is notably distinct from the model used in their live (and viral) sesame.com demo. In our testing, the CSM-1B model produces lower quality and far less emotionally rich speech.

✅ Pros

❌ Cons

Free & self-hostable

Not as impressive as Sesame's viral demo

Basic voice cloning


  1. Coqui XTTS v2.0.3

Coqui XTTS-v2.0.3 is an open-source TTS model that supports 17-languages support, claims <200ms TTFB (on suitable hardware). The model also supports 3-second zero-shot cloning.

✅ Pros

❌ Cons

Broadest language support of open-source models

Commercial use requires paid licence

Voice cloning with only a few seconds of audio

Needs a strong GPU to achieve <200ms

  1. Hexgrad Kokoro 82 M

Kokoro is an open-weight TTS model with 82M parameters. Kokoro is a very cost-effective model (under $1 per million characters of text input, or under $0.06 per hour of audio output).

✅ Pros

❌ Cons

Ultra-light and extremely fast

No built-in cloning capability

Very affordable hosted pricing

Smaller training set can = artifacts (rare)

  1. Chatterbox (Resemble AI)

Resemble AI's Chatterbox is an open-source, English TTS model that can achieve <200ms TTFB latency (on suitable hardware). Chatterbox can clone a new voice from a 5-second sample and supports simple emotion prompts.

✅ Pros

❌ Cons

Open weights with permissive licence

English only for now

A higher quality open-source option

Very young project

  1. Unmute (Kyutai)

Unmute combines low-latency TTS (≈ 220ms), STT and turn-taking in one MIT-licensed stack, with 10-second cloning and English/French support. The voice quality is high and testing with various accents yielded strong results.

✅ Pros

❌ Cons

Complete open pipeline

Limited language coverage so far

High quality english speaking accented voices

Also new: docs and tooling still evolving

  1. Fluxions AI

Fluxions is an open-source TTS model that targets ~200ms first-audio (with reports of 160ms when running on high-end GPUs). Fluxions includes a basic voice-cloning feature, though the clone quality substantially trails leaders like ElevenLabs.

✅ Pros

❌ Cons

Free, permissive MIT licence with easy self-hosting

Voice cloning quality lags

Good quality voice

Limited language support

What did we miss? 

If we missed a model, or you notice a detail that has changed, nuance we didn’t capture, or something else we should include in this guide, we’d love to hear from you.

Join beta launch list

Layercode is the developer platform to easily build production-ready voice AI agents.

Layercode™ is a trademark of Layercode, Inc. All rights reserved

Follow Layercode

Layercode is the developer platform to easily build production-ready voice AI agents.

Layercode™ is a trademark of Layercode, Inc. All rights reserved

Follow Layercode