Sep 23, 2025 | Read time 4 min

Non-English TTS still sounds like a Dalek. Here’s why and how to make text to speech sound less robotic.

Why most voices sound natural in English but still robotic in other languages, and how to fix it.
Stuart WoodProduct Manager

If you’ve tried text-to-speech, you’ll hear that English voices have come a long way in sounding natural. 

But, try languages beyond English and often it still feels robotic. Clean pronunciation in short phrases, then a flat delivery, odd pauses, and names that land wrong. 

It’s not the same challenge as speech-to-text but shares some similar solutions. In real-time STT, you care about high-accuracy transcription for many different speakers (age, gender, language, accent). 

In TTS, you are generating speech that must feel human across languages, scripts, and local habits. Different outputs, different failure modes. 

At Speechmatics we lead in real-time STT, we’ve put a lot of energy into building over 55 high-quality languages that understand the nuances of dialects and accents. 

In STT, you have one valid output (the words spoken) ; everything else is wrong. 

TTS has a different challenge; one set of words can be spoken in many ways and all could be correct, depending on your expectations as a user.

Why TTS often gets robotic in non-English speech 

Moving from English to other languages turns text-to-speech into a precision task. Dialect, tone, formatting, and pronunciation rules vary widely, and even small slips make speech sound flat or wrong. Here’s what usually breaks in non-English TTS:

Language dialects. Spanish has regional pronunciation differences and pronoun systems. A Castilian voice reading Latin American text can feel off to a large audience. Brazilian and European Portuguese diverge in phonetics and intonation. 

Tone trouble. In Mandarin, the same syllable ma can mean “mother, hemp, horse or scold,” depending on pitch. Mandarin needs tone plus tone sandhi to sound right. 

Numbers, dates, and units. Numbers. “1.234” is one point two three four in English, but can be one thousand two hundred thirty four in Spanish or German. Dates flip order across regions. Acronyms need expansion rules per language. 

Names, places, and code-switching. Real users mix languages mid-sentence, then drop a local street name or company acronym. Cross-lingual front-ends struggle with borrowed sounds and local orthography. The result is choppy, with resets between segments. 

Script and normalisation. Arabic often arrives without diacritics, Thai arrives without spaces, Chinese and many other languages mix characters with Latin brand names. This can throw off systems designed to work well in English. 

G2P coverage and homographs. Grapheme-to-phoneme models trained on English do not transfer cleanly. Turkish vowel harmony, Hindi schwa deletion, French liaison, Portuguese European versus Brazilian variants, all need explicit handling or a lot more data.

Why this is different to STT 

Real-time STT is a recognition problem. You optimize for high accuracy, fast transcription, noise robustness, punctuation, and speaker diarization. 

With TTS, you optimize for naturalness, local correctness, and listener comfort. Metrics change too. STT has stable accuracy measures (word error rate WER). TTS quality is subjective without deterministic tests, so providers lean on A/B tests, MOS, and task-based checks.

In short, STT is about hearing people. TTS is about being heard by people. 

What good multilingual TTS looks like 

Having seen where systems fall down, what does good look like? Multilingual TTS should sound native to the locale, respect local formatting, and handle code-switching and prosody without artifacts. Here are the essentials to demand.

  • Locale-true voices. Distinct models, voices or controls for es-MX versus es-ES, pt-BR versus pt-PT, en-IN versus en-GB. Not one global “Spanish”. 

  • Reliable text normalization. Dates, money, measurement, and list reading that match local habits. 

  • Robust G2P with dictionaries. Easy override for names and brands, with phoneme or IPA edits that stick. 

  • Code-switching support. Smooth transitions within a sentence without resets, including mixed scripts. 

  • Prosody controls. Simple sliders or SSML-like tags for rate, pitch, pauses, and emphasis that work well across languages, not just English. 

  • Low-latency streaming that stays smooth. No audible joins, no breath jumps, consistent energy over long durations. 

How to make text to speech sound less robotic: A quick evaluation playbook 

If you are buying or integrating TTS, run these short tests for each language: 

  1. Find a local person. The best way to validate this is to ensure you have a native speaker of the language and dialect/accent.

  2. Local names and addresses. Feed five city names, five surnames, and a mixed address from your target market. Listen for stress and vowel quality, not just letter accuracy.

  3. Dates, currency, and units. Generate one paragraph with all three. Check for natural pronunciations. 

  4. Code-switching. Add an English brand into Spanish, or an Arabic proper noun into French. The read should be smooth, not segmented. 

  5. Domain words. Include acronyms and technical terms. Try a custom pronunciation. Make sure the override applies consistently and appropriately for the accent or dialect you are targeting.

  6. Low latency at scale. Create a longer paragraph. Stream back thirty seconds or more of audio. Look for joins, pitch drift, and breathing that feels unnatural in the audio.

Closing thought 

Global users notice voice quality. If it sounds robotic, they trust it less and convert less. 

Speech-to-text is ready for prime time, and we ship it every day. Text-to-speech will get there for these languages, but it needs language-specific engineering and better data. 

Start with testing real examples, measure what you hear, and have a local person who can validate the output. Your customers will hear the difference if you don’t validate this, and it can make or break the success of projects. 

We’re looking to solve the speech synthesis challenge at scale across languages. You can try out the Preview today and join us for the journey.

FAQs

How to make text to speech sound less robotic? Start with locale-true voices, not a single “global” voice. Make text normalization, robust G2P plus dictionaries, real code-switching, and simple prosody controls non-negotiable. Keep latency low so long reads stay smooth.

Why does non-English TTS often sound robotic? Dialects differ, tone matters in languages like Mandarin, and local formats for numbers, dates, and acronyms can trip models. Names, mixed scripts, and mid-sentence language switches add more failure points.

What makes TTS a different challenge from STT? STT recognizes a single correct word sequence and is judged by stable accuracy metrics like WER. TTS must choose a natural rendition among many valid ones, so quality skews subjective and is often measured via A/B tests and MOS.

What should I test before I buy or integrate a TTS engine? Run a quick playbook per language: local names and addresses, dates and currency, domain acronyms with custom pronunciations, code-switching inside sentences, then a 30-second streaming read to catch joins and breath artifacts. Validate with a native speaker.

What does “good” multilingual TTS look like in practice? It sounds native to the locale, handles code-switching without resets, expands acronyms correctly, and respects local formatting. You should be able to nudge rate, pitch, pauses, and emphasis consistently across languages.

Try Speechmatics TTS in Preview

Experience how natural text-to-speech can sound across languages and test our new voices today.