Aug 21, 2025

Written and spoken language are fundamentally different in a number of ways that impact how you should approach building voice AI agents.
Consider how a human might communicate the same information:
Written: "The meeting is scheduled for 2:00 PM on February 14th, 2025, in Conference Room B."
Spoken: "We're meeting at two this afternoon in, uh, Conference Room B."
The differences run deeper than formatting. Written language is edited, polished, and permanent. Spoken language is messy, immediate, and ephemeral. It’s full of:
Filler words
Sentences that trail off or restart mid-thought
Contextual shortcuts ("that thing we discussed")
Emotional color through tone and pace
When an agent speaks with perfect grammar and formal structure, it feels unsettling or just plain “weird” to users.
Thankfully, improving your agent’s speech can be as simple as tuning your prompts.
This guide includes a range of easy to implement tips and example prompts to help you build LLM-powered voice agents that sound more human..
The fundamentals: From text to speech
Tell your LLM to speak, not write
This one is straightforward, but important: Begin your voice agent’s prompt with a single instruction to shift the LLM's entire response pattern and decrease the chance of responses that include content optimized for written text (e.g. URLs, email addresses spelled with @ symbols, formatting that makes no sense when spoken, etc). This can be as simple as:
Format for ears, not eyes
Spoken language is processed differently than written text, so it can be helpful to apply the ‘6th-grade reading level test’ and add clear instructions to your system prompt:
When tuning your prompts, it’s a good idea to test every response with the "speakable content" test: read it aloud. If you stumble or it sounds weird, rewrite it.
Prompt your LLM to speak like a human
Avoid robotic speech
Perfect speech sounds unnatural to humans. Directly including specific examples of natural speech patterns in your prompt can help output more natural responses. For example:
Conversation Markers That Matter
Train your agent to use natural speech elements like:
Acknowledgments:
"Got it"
"I see"
"Mm-hmm"
"Right"
Transitions:
"So" (starting a new thought)
"Actually" (gentle correction)
"Oh, and" (adding information)
"By the way" (side note)
Thinking sounds:
"Let me see..."
"Hmm..."
"Well..."
You can incorporate them into your prompt like this:
Adding personality without latency
Every extra word increases response time, so it’s important to keep your system prompt short! Instead of verbose character descriptions, inject personality through word choice and speech patterns:
The key techniques:
Define vocabulary: Give specific words that match the personality
Set response length limits: Enthusiasm doesn't mean rambling
Use tone markers: Exclamation points translate to energetic TTS delivery
Provide example phrases: Not full scripts, just personality anchors
This approach can help give your agent a distinct personality while keeping responses snappy and latency low.
Numbers, dates, and data: the TTS minefield
Nothing breaks immersion faster than hearing "dollar sign nineteen point nine nine" or "open parenthesis five five five close parenthesis."
Set universal rules
To avoid your agent speaking numbers robotically, add clear, universal rules to your system prompt:
TTS voice provider-specific considerations
Different TTS voice model providers handle text differently:
ElevenLabs: Enable apply_text_normalization in your API calls for automatic number handling. It's smart about context: "2024" becomes "twenty twenty-four" in dates but "two thousand twenty-four" for quantities.
Cartesia: Handles acronym context automatically. "NASA" is pronounced as a word, while "FBI" is spelled out
Rime: Supports phonetic hints for technical terms. Use <phoneme alphabet="ipa" ph="leɪtənsi">latency</phoneme> for precise pronunciation.
It’s worth reviewing your chosen TTS model(s) documentation to get familiar with nuances around how they handle parsing text into speech:
Beyond basics: Advanced TTS normalization
While phone numbers and dates are common stumbling blocks, voice agents encounter many other formatting challenges that can break the conversational flow. Depending on what your agent is designed to do, these could be relevant. Some examples:
Mathematical and scientific notation:
Technical content:
Why model choice matters: Larger models like ElevenLabs' Multilingual v2 and v3 (alpha) handle many of these cases automatically, while faster models require explicit formatting. This reinforces why model selection isn't just about speed—it's about balancing naturalness with latency for your specific use case.
Location-aware formatting: International applications need special attention. "01/02/2023" means January 2nd in the US but February 1st in Europe. Consider adding locale hints to your prompts when serving global users.
Common pitfalls and production fixes
Handling awkward moments
Interruptions: This one’s important — people like to interrupt! Configure your voice agent to handle interruptions gracefully by adding this to your prompt:
Silence: Too much dead air can cause users to assume your agent has broken. Consider adding immediate acknowledgments, like:
Note: Depending on the agent library you're using this may be something you set manually per tool call.
Avoiding "Wikipedia syndrome"
LLMs love to show off their knowledge, but this is rarely ideal. Giving them examples can help guide them to more concise responses:
Add to your prompt: "Give brief, conversational answers. Save details for follow-up questions."
Avoid creating an "overapologetic assistant"
Nothing sounds less confident than constant apologies:
Remember: In voice AI, speed > intelligence
Users are more likely to forgive a slightly imperfect answer delivered quickly than a perfect answer that arrives after six seconds of awkward silence.
To minimize additional latency we recommend choosing an LLM that prioritises speed. In our experience, the best options today are:
Gemini Flash 2.5 - Blazing fast, good enough for most queries
GPT-4o-mini - Excellent balance of speed and capability
Remember: You can always follow up with a more detailed response. Get something conversational out quickly, then elaborate if needed!
Natural beats perfect, every time
Voice AI agents have a better chance of succeeding when they sound human, not when they sound perfect.
Users won't notice if your agent occasionally says "uh" or takes a moment to think. They will notice if it reads out "dollar sign nineteen point nine nine" or launches into a Wikipedia-style monologue. For voice AI, natural beats perfect every single time.
Not everything covered in this post will be relevant to every type of voice agent, but applying some of these techniques should be enough to help transform robotic agents into more natural conversationalists.
Have we missed any tips or discovered any edge cases in your own voice agent development? We'd love to hear what's worked (or hasn't) in your production deployments. Drop us a line or shoot me a message on X or LinkedIn.
Acknowledgments
This guide consolidates our team’s direct experience, recommendations from voice AI builders, and general best practice from across the voice AI industry.
We're particularly grateful to the team at ElevenLabs, NIkhil Ramesh and Rime CEO Lily Clifford for documenting some terrific, actionable best practice around prompting for voice, and to Deepgram, Rime and Cartesia for additional prompting and TTS-specific formatting recommendations.
We’d also like to extend special thanks to all of the developers and companies who've shared their direct production experiences with us.