Aug 21, 2025
Voice AI
Voice AI

How to write prompts for voice AI agents

How to write prompts for voice AI agents

Aidan Hornsby
Aidan Hornsby

Written and spoken language are fundamentally different in a number of ways that impact how you should approach building voice AI agents.

Consider how a human might communicate the same information:

Written: "The meeting is scheduled for 2:00 PM on February 14th, 2025, in Conference Room B."

Spoken: "We're meeting at two this afternoon in, uh, Conference Room B."

The differences run deeper than formatting. Written language is edited, polished, and permanent. Spoken language is messy, immediate, and ephemeral. It’s full of:

  • Filler words

  • Sentences that trail off or restart mid-thought

  • Contextual shortcuts ("that thing we discussed")

  • Emotional color through tone and pace

When an agent speaks with perfect grammar and formal structure, it feels unsettling or just plain “weird” to users.

Thankfully, improving your agent’s speech can be as simple as tuning your prompts.

This guide includes a range of easy to implement tips and example prompts to help you build LLM-powered voice agents that sound more human..

The fundamentals: From text to speech

Tell your LLM to speak, not write

This one is straightforward, but important: Begin your voice agent’s prompt with a single instruction to shift the LLM's entire response pattern and decrease the chance of responses that include content optimized for written text (e.g. URLs, email addresses spelled with @ symbols, formatting that makes no sense when spoken, etc). This can be as simple as:

You are having a spoken conversation. Your responses will be read aloud 
by a text-to-speech system. Speak naturally, as if talking to a friend

Format for ears, not eyes

Spoken language is processed differently than written text, so it can be helpful to apply the ‘6th-grade reading level test’ and add clear instructions to your system prompt:

- Use simple vocabulary and short sentences
- Never use bullet points, numbered lists, or formatted text
- Avoid parentheses, brackets, or quotation marks in speech
- If you must mention a special character, spell it out
- Never include emojis (they can't be spoken) 
- Never use symbols that have no pronunciation (@#$%^&*)

When tuning your prompts, it’s a good idea to test every response with the "speakable content" test: read it aloud. If you stumble or it sounds weird, rewrite it.

Prompt your LLM to speak like a human

Avoid robotic speech

Perfect speech sounds unnatural to humans. Directly including specific examples of natural speech patterns in your prompt can help output more natural responses. For example:

Try to use natural speech. Don't use robotic speech. This is what I mean:
  
# Robotic:
"I have found three restaurants matching your criteria. 
  The first option is Luigi's Italian Restaurant located at 123 Main Street."
  
# Natural:
"Okay, so... I found three places that could work. 
First up is, uh, Luigi's - it's an Italian place on Main Street

Conversation Markers That Matter

Train your agent to use natural speech elements like:

Acknowledgments:

  • "Got it"

  • "I see"

  • "Mm-hmm"

  • "Right"

Transitions:

  • "So" (starting a new thought)

  • "Actually" (gentle correction)

  • "Oh, and" (adding information)

  • "By the way" (side note)

Thinking sounds:

  • "Let me see..."

  • "Hmm..."

  • "Well..."

You can incorporate them into your prompt like this:

When responding:
- Start responses with acknowledgments like "Got it" or "Okay"
- Use "um" or "uh" occasionally when thinking
- Add transitions like "So" or "Actually" between thoughts
-Include phrases like "Let me check" before processing

Adding personality without latency

Every extra word increases response time, so it’s important to keep your system prompt short! Instead of verbose character descriptions, inject personality through word choice and speech patterns:

# Inefficient personality prompt:
"You are an incredibly enthusiastic, wonderfully helpful, and 
exceptionally knowledgeable pizza ordering assistant who loves 
to share your deep passion for Italian cuisine with every customer."
  
# Efficient personality prompt:
"You're a pizza expert who helps customers quickly find what they want.
- Greet with: 'Hey!' or 'Hi there!'
- Use casual language: 'awesome' instead of 'excellent'
- Show enthusiasm through short exclamations: 'Perfect!' 'Great choice!'
- Keep responses under 2 sentences unless asked for details

The key techniques:

  • Define vocabulary: Give specific words that match the personality

  • Set response length limits: Enthusiasm doesn't mean rambling

  • Use tone markers: Exclamation points translate to energetic TTS delivery

  • Provide example phrases: Not full scripts, just personality anchors

This approach can help give your agent a distinct personality while keeping responses snappy and latency low.

Numbers, dates, and data: the TTS minefield

Nothing breaks immersion faster than hearing "dollar sign nineteen point nine nine" or "open parenthesis five five five close parenthesis."

Set universal rules

To avoid your agent speaking numbers robotically, add clear, universal rules to your system prompt:

# Phone numbers:
Format: (555) 123-4567
Speak as: "five five five, one two three, four five six seven"

# Money:
Format: $19.99
Speak as: "nineteen dollars and ninety-nine cents"

# Dates:
Format: 02/14/2025
Speak as: "February fourteenth, twenty twenty-five"

# Times:
Format: 3:30 PM
Speak as: "three thirty in the afternoon"

# Email addresses:
Format: john@company.com
Speak as: "john at company dot com"

TTS voice provider-specific considerations

Different TTS voice model providers handle text differently:

  • ElevenLabs: Enable apply_text_normalization in your API calls for automatic number handling. It's smart about context: "2024" becomes "twenty twenty-four" in dates but "two thousand twenty-four" for quantities.

  • Cartesia: Handles acronym context automatically. "NASA" is pronounced as a word, while "FBI" is spelled out

  • Rime: Supports phonetic hints for technical terms. Use <phoneme alphabet="ipa" ph="leɪtənsi">latency</phoneme> for precise pronunciation.

It’s worth reviewing your chosen TTS model(s) documentation to get familiar with nuances around how they handle parsing text into speech:

Beyond basics: Advanced TTS normalization

While phone numbers and dates are common stumbling blocks, voice agents encounter many other formatting challenges that can break the conversational flow. Depending on what your agent is designed to do, these could be relevant. Some examples:

Mathematical and scientific notation:

# Fractions:
Format: or 2/3
Speak as: "two-thirds"

# Roman numerals (context matters):
Format: Chapter XIV
Speak as: "Chapter fourteen"
Format: Queen Elizabeth II
Speak as: "Queen Elizabeth the second"

# Units and measurements:
Format: 100km
Speak as: "one hundred kilometers"
Format: 5GB
Speak as: "five gigabytes" (not "five G B")

Technical content:

# Keyboard shortcuts:
Format: Ctrl+Z
Speak as: "control Z"

# URLs (keep it natural):
Format: example.com/docs/guide
Speak as: "example dot com slash docs slash guide"

# File paths:
Format: C:\Users\Documents
Speak as: "Cdrive, usersfolder, documentsfolder"

Why model choice matters: Larger models like ElevenLabs' Multilingual v2 and v3 (alpha) handle many of these cases automatically, while faster models require explicit formatting. This reinforces why model selection isn't just about speed—it's about balancing naturalness with latency for your specific use case.

Location-aware formatting: International applications need special attention. "01/02/2023" means January 2nd in the US but February 1st in Europe. Consider adding locale hints to your prompts when serving global users.

Common pitfalls and production fixes

Handling awkward moments

Interruptions: This one’s important — people like to interrupt! Configure your voice agent to handle interruptions gracefully by adding this to your prompt:

If interrupted, acknowledge briefly with 'Oh, sorry, go ahead' and let the user speak

Silence: Too much dead air can cause users to assume your agent has broken. Consider adding immediate acknowledgments, like:

If you need time to process or call a function:

1. First say: 'Let me check that for you...'
2. Then do the processing
3. Return with: 'Okay, I found...'

Note: Depending on the agent library you're using this may be something you set manually per tool call.

Avoiding "Wikipedia syndrome"

LLMs love to show off their knowledge, but this is rarely ideal. Giving them examples can help guide them to more concise responses:

# Bad: 
"Paris is the capital and most populous city of France, 
with an estimated population of 2,165,423 residents as 
of January 1, 2023, in an area of 105.4 square kilometers..."
  
# Good:
"Paris? It's France's capital... about two million people 
live there. Beautiful city

Add to your prompt: "Give brief, conversational answers. Save details for follow-up questions."

Avoid creating an "overapologetic assistant"

Nothing sounds less confident than constant apologies:

Limit apologies:
- Maximum one "sorry" per conversation
- Replace "I'm sorry" with action: "Let me fix that"
- Don't apologize for system limitations, offer alternatives

Remember: In voice AI, speed > intelligence

Users are more likely to forgive a slightly imperfect answer delivered quickly than a perfect answer that arrives after six seconds of awkward silence.

To minimize additional latency we recommend choosing an LLM that prioritises speed. In our experience, the best options today are:

  1. Gemini Flash 2.5 - Blazing fast, good enough for most queries

  2. GPT-4o-mini - Excellent balance of speed and capability

Remember: You can always follow up with a more detailed response. Get something conversational out quickly, then elaborate if needed!

Natural beats perfect, every time

Voice AI agents have a better chance of succeeding when they sound human, not when they sound perfect.

Users won't notice if your agent occasionally says "uh" or takes a moment to think. They will notice if it reads out "dollar sign nineteen point nine nine" or launches into a Wikipedia-style monologue. For voice AI, natural beats perfect every single time.

Not everything covered in this post will be relevant to every type of voice agent, but applying some of these techniques should be enough to help transform robotic agents into more natural conversationalists.

Have we missed any tips or discovered any edge cases in your own voice agent development? We'd love to hear what's worked (or hasn't) in your production deployments. Drop us a line or shoot me a message on X or LinkedIn.

Acknowledgments

This guide consolidates our team’s direct experience, recommendations from voice AI builders, and general best practice from across the voice AI industry. 

We're particularly grateful to the team at ElevenLabs, NIkhil Ramesh and Rime CEO Lily Clifford for documenting some terrific, actionable best practice around prompting for voice, and to Deepgram, Rime and Cartesia for additional prompting and TTS-specific formatting recommendations. 

We’d also like to extend special thanks to all of the developers and companies who've shared their direct production experiences with us.

Join beta launch list

Layercode is the developer platform to easily build production-ready voice AI agents.

Layercode™ is a trademark of Layercode, Inc. All rights reserved

Follow Layercode

Layercode is the developer platform to easily build production-ready voice AI agents.

Layercode™ is a trademark of Layercode, Inc. All rights reserved

Follow Layercode