What does Speechmatics do?

Speechmatics provides speech technology and Voice AI for enterprises, offering accurate Speech-to-Text, Text-to-Speech, and Voice Agent solutions. Our models understand every voice and accent across 55+ languages, helping businesses unlock the full potential of voice data.

How accurate is Speechmatics Speech-to-Text?

Speechmatics delivers best-in-market accuracy, achieving up to 99% word accuracy and 96% medical keyword recall in industry benchmarks. Our models handle multiple accents, noisy environments, and multi speakers with ease.

What makes Speechmatics Text-to-Speech different?

Our low-latency Text-to-Speech (TTS) delivers lifelike, human-sounding voices with sub-150ms latency that is ideal for real-time conversations. Developers can stream natural speech in multiple voices and deploy it in the cloud, hybrid, or on-prem for privacy and control.

Can I build real-time voice agents with Speechmatics?

Our voice AI enables developers to build real-time voice agents that listen, understand, and respond naturally. Plug in fast with a flexible API and native integrations to power your AI voice agents.

Which industries use Speechmatics?

Speechmatics is trusted by organizations in media, healthcare, contact center, finance, education, and accessibility. Our technology powers transcription, translation, call analytics, and voice AI applications worldwide.

How to build a conversational agent in less time than Cupid’s arrow takes to strike

What happens when you set out to build a fully functioning AI love guru with very little turnaround time?

You get Eros – a delightfully unhelpful chatbot, with a knack for disastrous matchmaking and over-the-top romantic advice.

In this post, I’ll walk you through how we brought Eros to life faster than a whirlwind romance – and, more importantly, how Speechmatics' unique approach to voice tech makes it possible for anyone (including you) to build voice AI.

Breaking it down into three components

At its core, a conversational AI follows a three-step process:

1) Speech to text (ASR): Capturing user input and converting it into text.

2) Language processing (LLM): Determining the AI’s response based on the input and predefined system behavior.

3) Text to speech (TTS): Converting the response back into speech for the user.

For our project, we designed a Valentine’s-themed chatbot named Eros, an AI persona trained to deliver over-the-top romantic advice and hilariously mismatched love pairings.

Step 1: Automatic Speech Recognition (ASR)

The user interacts with the agent – this could be through their phone, a website, or another medium. They might say something like, "Hello, my name is Farah."

This spoken input is processed by our ASR system, which transcribes it into text.

Where traditional ASR systems might struggle with accents, background noise, or conversational speech patterns, our purpose-built architecture processes this spoken input differently. Speechmatics' ASR system, designed from the ground up to handle real-world speech variations, transcribes with remarkable accuracy across accents and environments that challenge conventional systems.

At this stage, we've successfully converted real-time speech into a format the LLM can understand – but with significantly fewer errors than other speech-to-text solutions, creating a foundation for more natural interactions.

Step 2: Large language model (LLM) processing

Once we have the transcribed text, it moves into the LLM. The LLM requires two key input:

The user query (prompt): The text transcribed by ASR (e.g., "Hello, my name is Farah.")
System context: A set of predefined instructions that guide the LLM’s behavior.

The system context is typically structured as a YAML or embedded directly into the API call. For Eros, we defined instructions like:

"Your name is Eros. You are super cheesy, you give horrible relationship advice, and you make very incompatible matches."

These elements are passed to the LLM through an API call, which can be directed to the LLM of your choice. This step really showcases our "Bring your own LLM" philosophy, where organizations can select the model that best fits their specific industry needs rather than being locked into a one-size-fits-all solution.

Our architecture is designed to integrate seamlessly with any language model, offering enterprises unprecedented flexibility while maintaining exceptional performance and enabling completely customizable solutions that speak your industry's unique language.

So for Eros, the response might be:

"Hello, I'm the god of love! Let me tell you... you and your worst enemy? A perfect match!"

For this project, we used the simplest LLM setup-no function calls or retrieval-augmented generation (RAG).

Function calling allows an LLM to execute predefined functions in the codebase or interact with external APIs to retrieve live data, such as real-time weather updates. This is useful for AI systems that need to perform actions like booking reservations or fetching internal data. For example, a customer service bot could use function calling to check a user’s account balance.

RAG, on the other hand, enables the LLM to retrieve information from external sources like databases or documents before generating a response. A customer service bot, for instance, might use RAG to access a user’s recent order history.

Since Eros relied solely on static context without function calling or retrieval, its responses were fully determined by the predefined system instructions.

Step 3: Text-to-speech (TTS)

Once the LLM generates a response, the final step is converting it back into speech using TTS.

There are multiple TTS providers available, each offering different voice styles, tones, and accents. The choice of TTS can significantly influence the personality of the agent. For Eros, we selected a voice that exaggerated its ridiculously bad matchmaking skills – something dramatic and over-the-top.

The generated response is then played back to the user, completing the conversational loop.

The takeaway: conversational AI doesn’t have to be complex

What’s exciting about this project is how quickly you can go from concept to execution using just three core technologies.

Whether you’re designing an enterprise-wide customer service bot or a playful AI matchmaker that absolutely shouldn’t be trusted, our approach to building voice technology differently means your conversational AI will be fast, accurate and unique.

Ready to hear and see Eros in action? Trust me when I say, take his advice with a grain of salt!

;

Apr 4, 2025 | Read time 4 min

How to build a conversational agent in less time than Cupid’s arrow takes to strike

Breaking it down into three components

Step 1: Automatic Speech Recognition (ASR)

Step 2: Large language model (LLM) processing

Step 3: Text-to-speech (TTS)

The takeaway: conversational AI doesn’t have to be complex

Cupid is overrated. Eros is here.

Related Articles

Speaker lock: Fixing Voice AI for the real world

Our fastest-growing companies have one thing in common: Real-time

7 ways AI can save your Valentine's Day (and one way it might ruin it forever)