Jul 15, 2025 | Read time 4 min

Speechmatics launches world’s first bilingual voice AI models for Southeast Asia

Voice AI systems built for the region’s real world of code-switching and multilingual communication.
Blog-SE Asia bilingual models-V1 - Header 16-9
Yahia Abaza
Yahia AbazaSenior Product Manger

Here's the thing about conversations in Southeast Asia: they rarely stay in one language. Take Singapore, for example - someone calling into a contact center might start in Malay and pivot to English for a technical explanation, all the while mixing in words and phrases from Mandarin.

Or an emergency dispatch,  which needs to handle callers naturally switching between Tamil and English in a high-pressure situation.

This linguistic fluidity is how millions of people in the region communicate. And until now, voice AI has been spectacularly bad at keeping up with this reality.

The bilingual breakthrough

The industry has only supported these languages individually, and without the region-specific subtleties that are needed for real-world Southeast Asian conversations.

Speechmatics set out to solve this with the world's first bilingual models specifically tailored for the region. Our three new models - Mandarin-English, Malay-English, and Tamil-English, deliver breakthrough performance in real-world scenarios.

The results: more than 60% improvement for Singaporean English and 15% improvement in code-switching scenarios compared to the nearest competitor.

Why general-purpose models trade accuracy for scale

The problem with general-purpose multilingual models? They may support a multitude of languages but they treat each language as a separate entity, struggling when speakers blend them naturally.

We took a different approach. Working with local partners, we trained entirely new models on region-specific datasets that capture how people actually speak across the region.

The breakthrough lies in understanding code-switching as natural communication rather than an error to be corrected.

When someone says that “other models 纸面上看起来不错, but in the real world 他们跟不上”,  our AI follows the conversation seamlessly.

The performance results speak for themselves:

  • 60%+ improvement for Singaporean English

  • 15% better accuracy in code-switching scenarios compared to the nearest competitor

  • Enhanced baseline performance for Malay and Tamil

  • Regional context awareness for Southeast Asian English variants

The result is voice AI that maintains high accuracy precisely because it understands how these language pairs work together.

Our specialized approach also delivers real-time transcription capabilities that general-purpose models, designed for batch processing, simply can't match.

There's no free lunch in machine learning. By focusing on specific language pairs and regional patterns, we achieve accuracy levels that broader models sacrifice for coverage.

Industries ready for change

We've been testing these models with select preview partners across emergency services, call centers, and law enforcement.

Early results show faster resolution times, fewer transcription errors, and improved customer satisfaction. A detailed case study is coming soon.

Building on our proven Spanish-English bilingual model, the Mandarin-English, Malay-English, and Tamil-English models are ready for deployment across key industries:

  • Emergency Services: Accurate transcription regardless of language switches

  • Contact Centers: Natural agent communication without quality loss

  • Media & Broadcasting: Real-time multilingual content production

  • Government: Inclusive citizen engagement across languages

For enterprise deployment, we offer on-premises options for strict data governance, HIPAA compliance for healthcare applications, and zero data retention for sensitive environments.

Available now

The Southeast Asia bilingual models are live on Speechmatics' platform today, with full API documentation and enterprise deployment support.

When voice AI starts appreciating local context, users feel much better understood, and agent workflows are much more efficient.

Ready to see what that looks like? Speak to our team today.

Power your products with enterprise-grade Voice AI

We handle the speech, you deliver conversations that matter.

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Tom Young
Tom YoungDigital Specialist
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR