Apr 4, 2025 | Read time 4 min

How to build a conversational agent in less time than Cupid’s arrow takes to strike

7 ways blog
Farah Gouda
Farah GoudaData Engineer

What happens when you set out to build a fully functioning AI love guru with very little turnaround time?

You get Eros – a delightfully unhelpful chatbot, with a knack for disastrous matchmaking and over-the-top romantic advice.

In this post, I’ll walk you through how we brought Eros to life faster than a whirlwind romance – and, more importantly, how Speechmatics' unique approach to voice tech makes it possible for anyone (including you) to build voice AI.

Breaking it down into three components

At its core, a conversational AI follows a three-step process:

1) Speech to text (ASR): Capturing user input and converting it into text.

2) Language processing (LLM): Determining the AI’s response based on the input and predefined system behavior.

3) Text to speech (TTS): Converting the response back into speech for the user.

For our project, we designed a Valentine’s-themed chatbot named Eros, an AI persona trained to deliver over-the-top romantic advice and hilariously mismatched love pairings.

Step 1: Automatic Speech Recognition (ASR)

The user interacts with the agent – this could be through their phone, a website, or another medium. They might say something like, "Hello, my name is Farah."

This spoken input is processed by our ASR system, which transcribes it into text

Where traditional ASR systems might struggle with accents, background noise, or conversational speech patterns, our purpose-built architecture processes this spoken input differently. Speechmatics' ASR system, designed from the ground up to handle real-world speech variations, transcribes with remarkable accuracy across accents and environments that challenge conventional systems.

At this stage, we've successfully converted real-time speech into a format the LLM can understand – but with significantly fewer errors than other speech-to-text solutions, creating a foundation for more natural interactions.

Step 2: Large language model (LLM) processing

Once we have the transcribed text, it moves into the LLM. The LLM requires two key input:

  • The user query (prompt): The text transcribed by ASR (e.g., "Hello, my name is Farah.")

  • System context: A set of predefined instructions that guide the LLM’s behavior.

The system context is typically structured as a YAML or embedded directly into the API call. For Eros, we defined instructions like:

"Your name is Eros. You are super cheesy, you give horrible relationship advice, and you make very incompatible matches."

These elements are passed to the LLM through an API call, which can be directed to the LLM of your choice. This step really showcases our "Bring your own LLM" philosophy, where organizations can select the model that best fits their specific industry needs rather than being locked into a one-size-fits-all solution. 

Our architecture is designed to integrate seamlessly with any language model, offering enterprises unprecedented flexibility while maintaining exceptional performance and enabling completely customizable solutions that speak your industry's unique language.

So for Eros, the response might be:

"Hello, I'm the god of love! Let me tell you... you and your worst enemy? A perfect match!"

For this project, we used the simplest LLM setup-no function calls or retrieval-augmented generation (RAG).

Function calling allows an LLM to execute predefined functions in the codebase or interact with external APIs to retrieve live data, such as real-time weather updates. This is useful for AI systems that need to perform actions like booking reservations or fetching internal data. For example, a customer service bot could use function calling to check a user’s account balance.

RAG, on the other hand, enables the LLM to retrieve information from external sources like databases or documents before generating a response. A customer service bot, for instance, might use RAG to access a user’s recent order history.

Since Eros relied solely on static context without function calling or retrieval, its responses were fully determined by the predefined system instructions.

Step 3: Text-to-speech (TTS)

Once the LLM generates a response, the final step is converting it back into speech using TTS.

There are multiple TTS providers available, each offering different voice styles, tones, and accents. The choice of TTS can significantly influence the personality of the agent. For Eros, we selected a voice that exaggerated its ridiculously bad matchmaking skills – something dramatic and over-the-top.

The generated response is then played back to the user, completing the conversational loop.

The takeaway: conversational AI doesn’t have to be complex

What’s exciting about this project is how quickly you can go from concept to execution using just three core technologies. 

Whether you’re designing an enterprise-wide customer service bot or a playful AI matchmaker that absolutely shouldn’t be trusted, our approach to building voice technology differently means your conversational AI will be fast, accurate and unique.

Latest Articles

Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate