Apr 4, 2025 | Read time 4 min

How to build a conversational agent in less time than Cupid’s arrow takes to strike

7 ways blog
Farah Gouda
Farah GoudaData Engineer

What happens when you set out to build a fully functioning AI love guru with very little turnaround time?

You get Eros – a delightfully unhelpful chatbot, with a knack for disastrous matchmaking and over-the-top romantic advice.

In this post, I’ll walk you through how we brought Eros to life faster than a whirlwind romance – and, more importantly, how Speechmatics' unique approach to voice tech makes it possible for anyone (including you) to build voice AI.

Breaking it down into three components

At its core, a conversational AI follows a three-step process:

1) Speech to text (ASR): Capturing user input and converting it into text.

2) Language processing (LLM): Determining the AI’s response based on the input and predefined system behavior.

3) Text to speech (TTS): Converting the response back into speech for the user.

For our project, we designed a Valentine’s-themed chatbot named Eros, an AI persona trained to deliver over-the-top romantic advice and hilariously mismatched love pairings.

Step 1: Automatic Speech Recognition (ASR)

The user interacts with the agent – this could be through their phone, a website, or another medium. They might say something like, "Hello, my name is Farah."

This spoken input is processed by our ASR system, which transcribes it into text

Where traditional ASR systems might struggle with accents, background noise, or conversational speech patterns, our purpose-built architecture processes this spoken input differently. Speechmatics' ASR system, designed from the ground up to handle real-world speech variations, transcribes with remarkable accuracy across accents and environments that challenge conventional systems.

At this stage, we've successfully converted real-time speech into a format the LLM can understand – but with significantly fewer errors than other speech-to-text solutions, creating a foundation for more natural interactions.

Step 2: Large language model (LLM) processing

Once we have the transcribed text, it moves into the LLM. The LLM requires two key input:

  • The user query (prompt): The text transcribed by ASR (e.g., "Hello, my name is Farah.")

  • System context: A set of predefined instructions that guide the LLM’s behavior.

The system context is typically structured as a YAML or embedded directly into the API call. For Eros, we defined instructions like:

"Your name is Eros. You are super cheesy, you give horrible relationship advice, and you make very incompatible matches."

These elements are passed to the LLM through an API call, which can be directed to the LLM of your choice. This step really showcases our "Bring your own LLM" philosophy, where organizations can select the model that best fits their specific industry needs rather than being locked into a one-size-fits-all solution. 

Our architecture is designed to integrate seamlessly with any language model, offering enterprises unprecedented flexibility while maintaining exceptional performance and enabling completely customizable solutions that speak your industry's unique language.

So for Eros, the response might be:

"Hello, I'm the god of love! Let me tell you... you and your worst enemy? A perfect match!"

For this project, we used the simplest LLM setup-no function calls or retrieval-augmented generation (RAG).

Function calling allows an LLM to execute predefined functions in the codebase or interact with external APIs to retrieve live data, such as real-time weather updates. This is useful for AI systems that need to perform actions like booking reservations or fetching internal data. For example, a customer service bot could use function calling to check a user’s account balance.

RAG, on the other hand, enables the LLM to retrieve information from external sources like databases or documents before generating a response. A customer service bot, for instance, might use RAG to access a user’s recent order history.

Since Eros relied solely on static context without function calling or retrieval, its responses were fully determined by the predefined system instructions.

Step 3: Text-to-speech (TTS)

Once the LLM generates a response, the final step is converting it back into speech using TTS.

There are multiple TTS providers available, each offering different voice styles, tones, and accents. The choice of TTS can significantly influence the personality of the agent. For Eros, we selected a voice that exaggerated its ridiculously bad matchmaking skills – something dramatic and over-the-top.

The generated response is then played back to the user, completing the conversational loop.

The takeaway: conversational AI doesn’t have to be complex

What’s exciting about this project is how quickly you can go from concept to execution using just three core technologies. 

Whether you’re designing an enterprise-wide customer service bot or a playful AI matchmaker that absolutely shouldn’t be trusted, our approach to building voice technology differently means your conversational AI will be fast, accurate and unique.

Latest Articles

Carousel slide image
Technical

De-risk your voice agent: The 11 best voice agent testing platforms in 2026

Voice agents that pass in demos routinely fail in production. This guide covers the 11 best voice agent testing platforms in 2026, with the Five-Layer Testing Framework, platform deep dives, open-source alternatives, and a decision guide by maturity stage.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team