Jan 19, 2022 | Read time 4 min

Understanding Children’s Voices: How Voice-to-Text Assists eLearning

Read about how the pandemic has had a profound effect on education, how voice-to-text technology misunderstands young voices, and if we can rely on it to help educate the next generation.
Understanding-Children’s-Voices How-Voice-to-Text-Assists-eLearning
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

Understanding Children’s Voices: How Voice-to-Text Assists eLearning

The COVID-19 pandemic has had an enormous impact on the education of children the world over. With school closures and teacher absenteeism, the once-regimented structure of education was disrupted like never before. As we start to piece back together the process of learning, could speech-to-text recognition not only help prevent children slipping further behind, but also help them catch up to pre-pandemic levels?

The acceleration of eLearning tools due to the pandemic has been remarkable. China, for example, sent 250 million children home for online classes. However, deep learning was still necessary. The World Economic Forum reports how Zhejiang University uploaded more than 5000 courses online a mere two weeks after the transition began, using a system called DingTalk ZJU.

This home learning was replicated all over the world, with at least 210 countries and billions of students affected by the pandemic and stay-at-home orders. This forced teachers and students alike to become familiar with scanning documents, hosting online sessions, and other unfamiliar technologies – transcription software included.

Seen But Not Heard

At around the age of 2, far too often we bracket children into “those who speak” and “those who don’t”. We take for granted that children are continually learning to speak way beyond those early years. While adults have heard most everyday words time and time again – and used them just as much – most children discover new ones practically every day. When they repeat them, they do so in different ways, getting used to the sounds these words make. This process of trial and error is one of many reasons why current voice recognition isn’t as accurate with children as it is adults.

But children’s voice differs from adults in a variety of ways too. It’s not just the obvious difference in pitch, but the patterns themselves, which often trip up voice recognition. As reported by TechCrunch, children can hit different parts of words to adults, they can over-enunciate, punctuate differently and carry fewer common cadences. All of this, historically, leads to children’s voices being disproportionately failed by a technology that focused primarily on adults.

Engaging Young Voices

Arguments have been made for some time now, that subtitles can play a huge role in helping develop literacy skills. After all, the more children get to see words in action, the easier it is for them to understand and repeat them. It stands to reason then, that the benefits of live captioning in classrooms would have similar effects. But when young voices are still often misunderstood by speech-to-text technology, can we really rely on it to help teach our children? The answer is obvious, make the technology more accurate.

And that’s exactly what we’ve done at Speechmatics. We’ve seen our accuracy hugely improve and the gaps plummet when it comes to adults and children, thanks in large part to an introduction of self-supervised learning (SSL) into our training.

Before SSL, there was a frustratingly large unavailability of data to train on. Especially when it came to young voices. In fact, we were left to train on 30,000 hours of audio, and often, this was mostly adult speech. Since SSL, we can now train on 1.1 million hours of audio, exponentially increasing the number of voices from children.

Now that the majority of children have returned to the classroom, the technology adopted in the height of pandemic has become part of the everyday. When that “everyday” is a classroom full of noise, speech-to-text software has yet another obstacle in its path to accuracy: background noise. Again, here, Speechmatics has seen incredible success with our Autonomous Speech Recognition. Whether the background noise is pitch, reverb or volume, our results in our latest round of testing, show we’re far and above our competitors in terms of accuracy.

Benedetta Cevoli, Data Science Engineer, Speechmatics

Latest Articles

Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate