Solving the problem of accents for speech recognition languages

Global Spanish solves the accent-gap for speech recognition languages by supporting all major Spanish accents and dialects for use in speech-to-text transcription.

The challenge when it comes to global languages

With approximately 500 million speakers globally, Spanish is the second most natively spoken language in the world – and fourth most spoken language overall. But its global appeal and diversity of accents and dialects mean Spanish poses a significant challenge when it comes to providing consistent and accurate speech-to-text transcription.

To get accurate transcripts from Spanish speakers, speech-to-text technology providers usually create multiple Spanish language packs, each specializing in a specific region or speaker profile. But, in the real world, audio files often include more than one speaker from multiple regions, all with different accents, dialects and idiosyncrasies.

Deploying accent-specific language packs requires organizations to make a best guess as to the appropriate language pack to use for each audio file. It also requires them to host and store multiple language packs for one language – adding to operational complexities and costs for what should be an efficient, automated process and workflow.

There is also the costly and time-consuming problem of having to run a transcription multiple times through accent-specific language packs when there are multiple speakers with different Spanish accents in a single audio file. In the case of an interview involving a Mexican and a native Spaniard speaking in Spanish, for example, two transcriptions would need to be run to get the best accuracy for each speaker – one using a Mexican-Spanish language model and one using a Spanish-Spanish model. If just the Mexican-Spanish model was used, the native Spanish accent may not be recognized very well.

A new machine learning approach to dealing with accent and dialect variations

The Speechmatics approach is different – we are the first company to do away with creating multiple language packs for different accents and dialects. We use our unique Automatic Linguist (AL) machine learning framework to build language packs using machine learning. AL was a winner in the Innovation category of the 2019 Queen's Awards for Enterprise.

We started by creating a pioneering Global English language pack encompassing all major English accents and dialects. We then turned our attention to Spanish and created a Global Spanish language pack.

The benefits of the Speechmatics Global Spanish language pack

Our unique approach involves using machine learning to create a single, comprehensive language pack, accurately encompassing as many variations of Spanish as possible. For most real-world applications, this gives the most reliable, accurate and efficient performance for our customers and partners.

Our single language pack solution means users do not need to identify which Spanish variant is being spoken. When audio files feature multiple speakers with different accents – or where speaker accents are not known in advance – Global Spanish provides reliable results over a broader range of speakers.

In addition, by focusing resources on maintaining and updating fewer language packs, Speechmatics can increase quality, improve accuracy and ensure reliability for our customers and partners.

Global Spanish in the real world

A survey we conducted in 2019 found that Spanish and English are the most important languages for the contact center industry specifically.

As brands look to grow their reach, they also have to meet customer expectations and optimize their experience to drive loyalty and reduce churn. This means delivering localized and personalized services to those customers. The ability to use any-context speech recognition technology to transcribe Spanish accurately enables contact centers to use voice data to improve customer experiences and empower agents.

A 2016 survey by ICMI discovered that 57% of customers expect the service from their contact center to be in their native language – as opposed to the primary language of the contact center.

How we are innovating to deliver better performance across more speech recognition languages

Speech-to-text technology has advanced hugely in recent years, giving step-change improvements in a field used to marginal gains. In particular, modern neural network architectures are capable of generalizing across variations in speech. Deep neural networks feature multiple layers between input and output. This effectively gives us the performance of a variety of specialized models, all in one comprehensive language pack.

Single modern servers are more powerful than old, room-filling supercomputers. This astonishing rise in compute power, coupled with the repurposing of GPUs, gives masses of computing power. The advancements in compute power allow Speechmatics to train models based on more data, capable of supporting more variations in a single language pack.

By investing more time in gathering data from a wide range of sources, we have created a huge and diverse training corpus. This allows us to train models with a much wider range of applications than ever before. Speechmatics is already delivering leading levels of accuracy in speech recognition.

We are also investing in research and development to find new ways of solving problems to help our customers and partners innovate with voice. These approaches will deliver even better levels of accuracy across more speech recognition languages while making it easier to operate the Speechmatics solution.

Dec 8, 2020 | Read time 4 min