Content is everywhere. And not just written content. Video and audio too. And not just across social media and in our leisure time too. Our work is filled with it. Internal meetings, external meetings, customer calls, the media we create as companies. Show us someone who keeps on top of it all, and we’ll show you a liar. Or at least someone who doesn’t sleep enough.
Speech-to-text technology was a great first step in elevating some of these headaches. Accurately transcribing the audio in these various formats at least provides a searchable, scannable record, but would still require an enormous amount of time to read through.
The answer? Summarization.
We’ve recently launched Summarization into our unified API, our first step in delivering a comprehensive suite of speech understanding features for our customers.
Our Summarization utilizes abstractive summarization, a powerful model in AI and natural language processing. With abstractive summarization, we analyze the input, extract the key points, and produce a summary that captures the essence of the content. Unlike extractive summarization, which rearranges (and reduces) existing text, our advanced approach involves comprehending and generating new language for more effective summarization.
We offer flexibility in our Summarization output by letting users choose between content and summary type, as well as the summary length as depicted below.
Let’s talk through some of the things you can see above:
Auto – The Auto output option detects the style of content and chooses the best summary style to match the audio input.
Conversational – Our conversational summary format is designed to perform well for unstructured dialogues & conversations with multiple people, ideal for meetings, sales calls, and contact centers.
Informative – Our informative summary format is designed to work best for podcasts/news or other media content. The output delivered allows it to be more structured for the information that is being delivered by one or more people.
Brief – provides a succinct summary, condensing the content into just a few sentences.
Detailed – provides a longer, structured summary. For conversational content, it includes key topics and a summary of the entire conversation. For informative content, it logically divides the audio into sections and provides a summary for each.
Pretty self-explanatory this one – you can either choose full sentences and prose or have the output arranged into bullet points.
This gives users a huge amount of flexibility in the type of summaries they want (and find most useful). You can read our examples and tutorials in our full documentation.
So, what's powering all of this?
If you haven’t heard of ChatGPT, where have you been? It’s hard to now think of a world before the endless LinkedIn articles outlining ‘Ten Ways to Start A Side Hustle Using ChatGPT’. Now ChatGPT, and the Large Language Models (LLMs) that power it, seem to be everywhere.
So, what exactly are Large Language Models?
Well, let’s ask ChatGPT (very meta of us):
LLMs have a wide range of applications, including language translation, content creation, text summarization, chatbots & conversational AI, speech understanding, sentiment analysis, and more.
They also exhibit a couple of key characteristics that make it particularly useful when thinking about summarization:
Contextual Understanding: this means they can comprehend the meaning of a word or phrase based on the surrounding context.
Creative Text Generation: LLMs can generate coherent and contextually relevant text, making them capable of tasks like writing articles, stories, code, and even engaging in conversations.
This makes them a great use case for taking a long transcript and creating a summary based on it. Rather than simply remove words until you’re left with a short (but probably stilted) summary, they can generate new sentences based on their understanding of what was said given the context.
While LLMs are clearly powerful, they do have certain limitations. One key challenge is the amount of text you can pass into it at a given time – sometimes this limit can be as low as 3,000 words. To put this into perspective, a 1-hour transcript might contain as many as 9,000 words. This seems to be a big blocker here since the aim of summarization is to take extremely long transcripts and make them more digestible.
Well, fear not.
Speechmatics’ Summarization enables you to summarize files of any duration. This is particularly useful in scenarios where you’d like to summarize day-long meetings or workshops. So, if you were on a beach, or on a yoga retreat, or simply binge-watching Ted Lasso, you will still be able to catch up on what you missed.
Our team have worked hard to incorporate the latest advancements as they emerge into our speech APIs, whilst adding functionality and removing limitations to make it as valuable as possible to our users.
Summarization can give every agent a summary of customer interactions. This both reduces admin time, but also allows other agents to review previous conversations quickly, focusing on dispute resolution and customer experience rather than retreading old conversations (and adding to frustration). Summaries can also be used to automate tasks, as well as being used by supervisors and sales enablement teams.
Enhance team communication, whilst also saving time, with accurate notes presented in an easy-to-digest and shareable way. Everyone interested can stay informed, even if they didn’t attend the meeting in question.
Provide key takeaways and highlights for all content created, which can be used for descriptions, recaps, and to make content searchable. Viewers who missed content can stay in the loop and engage with content even when they are tight on time.
Summarization has already launched in our Portal – you can create a Portal account free, right now, and start generating useful summaries to your heart’s content (well, up to 8 hours per month for free).
The amount of content being created isn’t going down, and won’t. LLMs and APIs like Speechmatics give you a powerful new tool to help improve productivity, increase collaboration, and make the most of the time you have. You might even be able to convince your friends that you’re an expert coffee grinder, even if you didn’t make it through those 90 minutes.
![[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3I31FQHBheddd0CibURFBv%2F4355036ed3d14b4e1accb3fe39ecd886%2FArabic-English-blog-Jade-wide-carousel.webp&w=3840&q=75)
Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.
![[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2qdoWdIOsIygVY0cwl8UD4%2Fe7725d963a96f84c87d614ccc6cce3c6%2FAdobeStock_669627191-wide-carousel.webp&w=3840&q=75)
Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.
![[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3TUGqo1FcOmT91WhT3fgbo%2F9a07c229c11f8cbe62e6e40a1f8682c7%2FImage_fx__8__1-wide-carousel.webp&w=3840&q=75)
As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.
![[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F7LI5VH9yspI5pKWFeiZBXC%2F92f6a47a06ab6a97fb7f5a953b998737%2FCyan-wide-carousel.webp&w=3840&q=75)
Turning real-time clinical speech into trusted, EHR-native automation.
![[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F4jGjYveRLo3sKjzBzMIXXa%2F11e90a40df418658e9c15cb1ecff4e4b%2FBlog_image-wide-carousel.webp&w=3840&q=75)
What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.