How to write prompts for voice AI agents

$2,000 free credits for voice AI Startups

Apply to Startup Program →

$2,000 free credits

Apply to Startup Program →

Start Building

Aug 21, 2025

Voice AI

Aidan Hornsby

Written and spoken language are fundamentally different in a number of ways that impact how you should approach building voice AI agents.

Consider how a human might communicate the same information:

Written: "The meeting is scheduled for 2:00 PM on February 14th, 2025, in Conference Room B."

Spoken: "We're meeting at two this afternoon in, uh, Conference Room B."

The differences run deeper than formatting. Written language is edited, polished, and permanent. Spoken language is messy, immediate, and ephemeral. It’s full of:

Filler words
Sentences that trail off or restart mid-thought
Contextual shortcuts ("that thing we discussed")
Emotional color through tone and pace

When an agent speaks with perfect grammar and formal structure, it feels unsettling or just plain “weird” to users.

Thankfully, improving your agent’s speech can be as simple as tuning your prompts.

This guide includes a range of easy to implement tips and example prompts to help you build LLM-powered voice agents that sound more human…

The fundamentals: From text to speech

Tell your LLM to speak, not write

This one is straightforward, but important: Begin your voice agent’s prompt with a single instruction to shift the LLM's entire response pattern and decrease the chance of responses that include content optimized for written text (e.g. URLs, email addresses spelled with @ symbols, formatting that makes no sense when spoken, etc). This can be as simple as:

You are a helpful conversation voice AI assistant.
You are having a spoken conversation.
Your responses will be read aloud by a text-to-speech system

Format for ears, not eyes

Spoken language is processed differently than written text, so it can be helpful to apply the ‘6th-grade reading level test’ and add clear instructions to your system prompt:

- Use simple vocabulary and short sentences
- Never use bullet points, numbered lists, or formatted text
- Avoid parentheses, brackets, or quotation marks in speech
- If you must mention a special character, spell it out
- Never include emojis (they can't be spoken) 
- Never use symbols (@#$%^&*)

When tuning your prompts, it’s a good idea to test every response with the "speakable content" test: read it aloud. If you stumble or it sounds weird, rewrite it.

Conversation Markers That Matter

Train your agent to use natural speech elements like:

Acknowledgments:

"Got it"
"I see"
"Mm-hmm"
"Right"

Transitions:

"So" (starting a new thought)
"Actually" (gentle correction)
"Oh, and" (adding information)
"By the way" (side note)

Thinking sounds:

"Let me see..."
"Hmm..."
"Well..."

You can incorporate them into your prompt like this:

When responding:
- Start responses with acknowledgments like "Got it" or "Okay"
- Use "um" or "uh" occasionally when thinking
- Add transitions like "So" or "Actually" between thoughts

Numbers, dates, and data: the TTS minefield

Pronounciation of numbers, dates, times, and special characters is also crucial for voice applications. TTS (text-to-speech) providers handle pronounciations in different ways. A good base prompt that guides the LLM to use words to spell out numbers, dates, addresses etc will work for common cases.

Convert the output text into a format suitable for text-to-speech. Ensure that numbers, symbols, and abbreviations are expanded for clarity when read aloud. Expand all abbreviations to their full spoken forms.

Example input and output:

"$42.50" → "forty-two dollars and fifty cents"
"£1,001.32" → "one thousand and one pounds and thirty-two pence"
"1234" → "one thousand two hundred thirty-four"
"3.14" → "three point one four"
"555-555-5555" → "five five five, five five five, five five five five"
"2nd" → "second"
"XIV" → "fourteen" - unless it's a title, then it's "the fourteenth"
"3.5" → "three point five"
"⅔" → "two-thirds"
"Dr." → "Doctor"
"Ave." → "Avenue"
"St." → "Street" (but saints like "St. Patrick" should remain)
"Ctrl + Z" → "control z"
"100km" → "one hundred kilometers"
"100%" → "one hundred percent"
"elevenlabs.io/docs" → "eleven labs dot io slash docs"
"2024-01-01" → "January first, two-thousand twenty-four"
"123 Main St, Anytown, USA" → "one two three Main Street, Anytown, United States of America"
"14:30" → "two thirty PM"
"01/02/2023" → "January second, two-thousand twenty-three" or "the first of February, two-thousand twenty-three", depending on locale of the user

TTS voice provider-specific considerations

Different TTS voice model providers handle text differently:

ElevenLabs: Enable apply_text_normalization in your API calls for automatic number handling. It's smart about context: "2024" becomes "twenty twenty-four" in dates but "two thousand twenty-four" for quantities.
Cartesia: Handles acronym context automatically. "NASA" is pronounced as a word, while "FBI" is spelled out
Rime: Supports phonetic hints for technical terms. Use <phoneme alphabet="ipa" ph="leɪtənsi">latency</phoneme> for precise pronunciation.

It’s worth reviewing your chosen TTS model(s) documentation to get familiar with nuances around how they handle parsing text into speech:

Common pitfalls and production fixes

Avoiding "Wikipedia syndrome"

LLMs love to show off their knowledge, but this is rarely ideal. Giving them examples can help guide them to more concise responses:

# Bad: 
"Paris is the capital and most populous city of France, 
with an estimated population of 2,165,423 residents as 
of January 1, 2023, in an area of 105.4 square kilometers..."
  
# Good:
"Paris? It's France's capital... about two million people 
live there. Beautiful city

Add to your prompt: "Give brief, conversational answers. Save details for follow-up questions."

Avoid creating an "overapologetic assistant"

Nothing sounds less confident than constant apologies:

Limit apologies:
- Maximum one "sorry" per conversation
- Replace "I'm sorry" with action: "Let me fix that"
- Don't apologize for system limitations, offer alternatives

Remember: In voice AI, speed > intelligence

Users are more likely to forgive a slightly imperfect answer delivered quickly than a perfect answer that arrives after six seconds of awkward silence.

To minimize additional latency we recommend choosing an LLM that prioritises speed. In our experience, the best options today are:

Gemini Flash 2.5 - Blazing fast, good enough for most queries
GPT-4o-mini - Excellent balance of speed and capability

Remember: You can always follow up with a more detailed response. Get something conversational out quickly, then elaborate if needed!

Natural beats perfect, every time

Voice AI agents have a better chance of succeeding when they sound human, not when they sound perfect.

Users won't notice if your agent occasionally says "uh" or takes a moment to think. They will notice if it reads out "dollar sign nineteen point nine nine" or launches into a Wikipedia-style monologue. For voice AI, natural beats perfect every single time.

Not everything covered in this post will be relevant to every type of voice agent, but applying some of these techniques should be enough to help transform robotic agents into more natural conversationalists.

Have we missed any tips or discovered any edge cases in your own voice agent development? We'd love to hear what's worked (or hasn't) in your production deployments. Drop us a line or shoot me a message on X or LinkedIn.

Acknowledgments

This guide consolidates our team’s direct experience, recommendations from voice AI builders, and general best practice from across the voice AI industry.

We're particularly grateful to the team at ElevenLabs, NIkhil Ramesh and Rime CEO Lily Clifford for documenting some terrific, actionable best practice around prompting for voice, and to Deepgram, Rime and Cartesia for additional prompting and TTS-specific formatting recommendations.

We’d also like to extend special thanks to all of the developers and companies who've shared their direct production experiences with us.

‹ New: CLI preview, bring your own keys, product and pricing updates

Text-to-Speech voice AI model guide 2025 ›