Oct 17, 2023 | Read time 5 min

The Future of Media: ASR and Speech Intelligence

Capturing the spoken word - how ASR and AI are transforming media content.
Speech Intelligence Article Header
Will Williams
Will WilliamsChief Technology Officer

Automatic speech recognition (ASR) can transcribe and caption video content in real-time. This saves human transcribers a huge amount of labor and shows regulators a commitment to accessible content.

But can it be more? Definitely.

Combine ASR with the latest innovations in artificial intelligence (AI) and it adds value to media companies that far exceed acceptably accurate captioning.

The changing face of video

Video is fast becoming the preferred format for media consumption. In 2022, demand for video content skyrocketed, accounting for nearly 66% of total internet volume in the first six months of the year – a 24% increase on the same period in 2021.

This growing preference for video content has sparked nuances in the way audiences consume it. Younger cohorts are showing a preference towards the short-form videos native to TikTok, this helped the format to steal internet traffic volume from more traditional social networking sites last year. Similarly, second screening is on the rise, with viewers consuming media on mobile devices at the same time as watching video on a larger screen.

The result is a devaluing of audio. If you're consuming video on a mobile phone outside of the home, sound might not always be appropriate. And if you're watching two things at once, you can only choose to listen to one.

Audiences that are increasingly used to seeing video content captioned are also more comfortable relying on those captions. Four out of five viewers aged 18-25 use subtitles all or part of the time when watching content, compared to just one in four viewers aged 56-75.

In short, captioning has moved on from an exercise in accessibility to an essential component of video content. Only two-thirds of uncaptioned content is watched until the end, while 91% of videos with captions are viewed in their entirety.

AI media captioning

The behaviors around video content consumption might be changing at pace, but the media industry has been slower to cotton on to the commercial implications of that.

Media companies and independent software vendors have seen AI develop to the point at which ASR and automated captioning have become accessible to even the smallest budgets. This, coupled with the belief that media captioning is little more than a compliance requirement, means that for many product teams, ASR has become a begrudged line on their budget. An item sought at the cheapest possible price per hour, or cost per minute, while still providing acceptable accuracy.

In a market where audiences increasingly expect and rely on captions, it's a short-sighted approach. Videos with captions have an increased reach of 16% compared to those without, and that's before we've even addressed the additional applications and functionality that ASR can support when combined with capabilities powered by large language models (LLMs).

Speech Intelligence for media

The development of LLMs simplified automated captioning, enabling it to achieve better accuracy than human transcribers – saving time and reducing errors in the process. However, the introduction of LLMs and advancements in AI mean transcription is only a small part of a much bigger picture.

This bigger picture – the combination of ASR with AI capabilities – is Speech Intelligence. More than just transcription, Speech Intelligence is the key to unlocking value from the spoken word through a collection of features and capabilities powered by AI. Built on ASR and integrated into media distribution and captioning platforms, it can fuel the growth of both software providers and their end users.

Converting verbal content into text opens up a whole world of audiences to media companies. Translation capabilities mean captions can be provided in multiple languages, in real-time, making content accessible to the broadest possible audience with minimal additional work. Speechmatics Ursa model, for example, can create live captions in both the original spoken language and 69 supported language pairs.

These capabilities aren't limited to live audiences, either. Media companies can unlock additional value from their back catalogs with Speech Intelligence. Foreign language captions can be automatically applied to existing content, making it internationally accessible for the first time.

Features like summarization work content even harder. This allows media companies to automatically create episode summaries, produce show notes, and highlight key insights – in multiple languages – with the click of a button. Similarly, topic detection enables a huge volume of existing content to be quickly tagged in multiple languages, ensuring back catalogs are easily navigable to staff and audiences around the world. Not only does this save end-users time, but it expands their total addressable market without the need to increase their team and enhances audience engagement.

Sentiment analysis of recorded speech can further increase reach, engagement and accessibility of content. For example, while captioning and transcription make content accessible to the deaf and hard of hearing, sight-impaired audiences can gain extra insight from audio tags giving information on the emotion and sentiment of the speaker.

With the right speech partner, media captioning platforms can leverage Speech Intelligence to deliver a host of valuable functionality for their customers. From the obvious efficiency gains of reducing manual transcription to more differentiated features that utilize translation, sentiment analysis and summarization.

Foundational accuracy

Speech Intelligence has the potential to increase engagement, platform utility and content reach – but the outputs are only as good as the accuracy of the underlying speech-to-text model these applications are built on. To successfully make use of the spoken word, it needs to be captured accurately and understood fully by ASR models that can reliably record a range of different dialects, accents and demographics. Without highly accurate speech-to-text capabilities, downstream applications will have limited usability or may not work at all for speakers from certain backgrounds.

Stand out with Speech Intelligence

Moving away from a focus on just captioning and transcription gives product leaders scope to create media captioning and distribution platforms that delight their partners and add real value. The spoken word is our primary means of communication, and the applications that can be built on it are infinite.

As video consumption continues to change, Speech Intelligence will be the means by which product teams deliver a platform that is noticeably differentiated. It will ensure your product leads in the market – however, that market may change.

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate