Apr 4, 2025 | Read time 4 min

How to build a conversational agent in less time than Cupid’s arrow takes to strike

7 ways blog
Farah Gouda
Farah GoudaData Engineer

What happens when you set out to build a fully functioning AI love guru with very little turnaround time?

You get Eros – a delightfully unhelpful chatbot, with a knack for disastrous matchmaking and over-the-top romantic advice.

In this post, I’ll walk you through how we brought Eros to life faster than a whirlwind romance – and, more importantly, how Speechmatics' unique approach to voice tech makes it possible for anyone (including you) to build voice AI.

Breaking it down into three components

At its core, a conversational AI follows a three-step process:

1) Speech to text (ASR): Capturing user input and converting it into text.

2) Language processing (LLM): Determining the AI’s response based on the input and predefined system behavior.

3) Text to speech (TTS): Converting the response back into speech for the user.

For our project, we designed a Valentine’s-themed chatbot named Eros, an AI persona trained to deliver over-the-top romantic advice and hilariously mismatched love pairings.

Step 1: Automatic Speech Recognition (ASR)

The user interacts with the agent – this could be through their phone, a website, or another medium. They might say something like, "Hello, my name is Farah."

This spoken input is processed by our ASR system, which transcribes it into text

Where traditional ASR systems might struggle with accents, background noise, or conversational speech patterns, our purpose-built architecture processes this spoken input differently. Speechmatics' ASR system, designed from the ground up to handle real-world speech variations, transcribes with remarkable accuracy across accents and environments that challenge conventional systems.

At this stage, we've successfully converted real-time speech into a format the LLM can understand – but with significantly fewer errors than other speech-to-text solutions, creating a foundation for more natural interactions.

Step 2: Large language model (LLM) processing

Once we have the transcribed text, it moves into the LLM. The LLM requires two key input:

  • The user query (prompt): The text transcribed by ASR (e.g., "Hello, my name is Farah.")

  • System context: A set of predefined instructions that guide the LLM’s behavior.

The system context is typically structured as a YAML or embedded directly into the API call. For Eros, we defined instructions like:

"Your name is Eros. You are super cheesy, you give horrible relationship advice, and you make very incompatible matches."

These elements are passed to the LLM through an API call, which can be directed to the LLM of your choice. This step really showcases our "Bring your own LLM" philosophy, where organizations can select the model that best fits their specific industry needs rather than being locked into a one-size-fits-all solution. 

Our architecture is designed to integrate seamlessly with any language model, offering enterprises unprecedented flexibility while maintaining exceptional performance and enabling completely customizable solutions that speak your industry's unique language.

So for Eros, the response might be:

"Hello, I'm the god of love! Let me tell you... you and your worst enemy? A perfect match!"

For this project, we used the simplest LLM setup-no function calls or retrieval-augmented generation (RAG).

Function calling allows an LLM to execute predefined functions in the codebase or interact with external APIs to retrieve live data, such as real-time weather updates. This is useful for AI systems that need to perform actions like booking reservations or fetching internal data. For example, a customer service bot could use function calling to check a user’s account balance.

RAG, on the other hand, enables the LLM to retrieve information from external sources like databases or documents before generating a response. A customer service bot, for instance, might use RAG to access a user’s recent order history.

Since Eros relied solely on static context without function calling or retrieval, its responses were fully determined by the predefined system instructions.

Step 3: Text-to-speech (TTS)

Once the LLM generates a response, the final step is converting it back into speech using TTS.

There are multiple TTS providers available, each offering different voice styles, tones, and accents. The choice of TTS can significantly influence the personality of the agent. For Eros, we selected a voice that exaggerated its ridiculously bad matchmaking skills – something dramatic and over-the-top.

The generated response is then played back to the user, completing the conversational loop.

The takeaway: conversational AI doesn’t have to be complex

What’s exciting about this project is how quickly you can go from concept to execution using just three core technologies. 

Whether you’re designing an enterprise-wide customer service bot or a playful AI matchmaker that absolutely shouldn’t be trusted, our approach to building voice technology differently means your conversational AI will be fast, accurate and unique.

Ready to hear and see Eros in action? Trust me when I say, take his advice with a grain of salt!

;

Latest Articles

[alt: Smiling man with gray hair sits against a teal background, holding a blank clipboard. He wears a blue sweater and appears relaxed and approachable, suggesting a friendly environment.]
Technical

Speech-to-text in production: what 36 years of hard lessons taught me

The founder who built speech recognition in 1989 on latency, turn detection and faulty pipelines

Dr Tony Robinson
Dr Tony RobinsonFounder
Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Tom Young
Tom YoungDigital Specialist
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer