Jul 8, 2023 | Read time 4 min

YouTube’s Captions Represent the Direct Need for Speech-to-Text Innovation

YouTube’s automated captioning service is notoriously unreliable and represents the dire need for innovation within the speech-to-text industry. Find out what we’re doing about it.
YouTube’s Captions Represent the Dire Need for Speech-to-Text Innovation
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

The Problem with YouTube's Captions

When you think of speech-to-text, you likely think of captioning. It's the most apparent use of speech-to-text most people will recognize, and with good reason. Research has shown huge benefits of video captions, beyond hearing aid. Captions play a key role in social media engagement and are very quickly becoming a must-have for any content creator.

YouTube is one of the most popular sites on the internet, ranking only behind Google (which owns YouTube anyway). The service provides auto-captioning – AI that translates speech-to-text as quickly and as accurately as possible. Accuracy, however, isn't a guarantee. 

And because of that, some people decide to disable auto-captioning completely and use their own captions to make sure that their content is accurately transcribed. This is particularly true for official channels or large channels with big audiences. These channels often have the budget to take care of their own captioning, so you may see differing results on YouTube. For the most part, however, YouTube's auto-captioning is notoriously unreliable. 

It demonstrates the AI industry's constant need for innovation. According to 3PlayMedia, 80% of viewers use captions for reasons other than hearing loss, highlighting how captioning has grown beyond the need for greater accessibility. 

Captions are a necessity – it's time we treat them that way. 

YouTube Is More Than Entertainment Now

Since its inception in 2005, YouTube's grown exponentially, amassing two billion users in 2022. It's clear, then, that YouTube has evolved beyond 'Charlie bit my finger.' Now, it's where you can learn anything and everything, in an easy-to-consume, digestible way. Educational and potentially life-saving videos aren't in every language, so millions of users will rely on the captioning service. 

According to a study, YouTube's automated captions are 60-70% accurate – equivalent to 1 in 3 incorrect words. Of course, the accuracy rate greatly depends on audio quality, but the clear need for accurate captioning means that the AI must be able to cope with any background noise, accents, or jargon. 

Of course, YouTube is a great platform for many reasons, but good quality and accessible captions are a must. You'll likely still understand most of the text, but the margin for error is still too large in this day and age. 

Addressing the Problem

At Speechmatics, we know the importance of the accuracy of our speech-to-text engine. Here, we're using YouTube's automated captioning as an example of the dire need for innovation across the speech-to-text board. Good automatic speech recognition (ASR) allows users to save time and money – resources they can use to create enjoyable and helpful content. In that endeavor, we compared our ASR to our significant competitors using 24 YouTube videos, ranging from 'Every Outfit Winnie Harlow Wears in a Week: 7 Days, 7 Looks' to 'Diving World Cup 2021: Men's 10m Final.' We found that our ASR displayed levels of accuracy above 90% for content with multiple speakers and accents, background noise, and challenging vernacular. 

This is only possible due to the introduction of self-supervised learning. Before using it, we trained our ASR on approximately 30,000 hours of labeled audio content. This type of data is very costly and comes with big accessibility issues, some voices are just left out. Since then, we've taken that number closer to 1,100,000 hours as we improve our engine by using a wealth of unlabeled data. Self-supervised learning is helping us to bridge the gap between well-curated, labeled speech representing only a selection of speakers and varied, everyday speech that covers a breadth of voice cohorts.

It's pretty straightforward – you get better results when you put accuracy first. 

Captioning's Prominence Shows No Signs of Slowing Down

We will continually train our ASR with as many voices as possible. We will also continue to ensure we carry out our aim to understand every voice and create genuinely accessible ASR. 

It's a good thing, too, as captioning spreads across the internet. For example, TechTimes reported that Twitter is working on implementing closed captions to the site. The AI is thought to be coming from the company itself, so we can't comment on its potential accuracy. We do know one thing, however: the shift in perspective on captioning is a welcome one. Moving from a mere add-on to a necessity means the market will keep trying to innovate to produce the most accurate and accessible engine. 

For us, that has always been the aim of the game. When you prioritize accuracy, you spend less time acknowledging the ASR's mistakes and more time absorbing the content. 

Isn't that what every streaming service or content producer out there wants?

Benedetta Cevoli, Data Science Engineer, Speechmatics

Latest Articles

Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team