May 16, 2023 | Read time 7 min

Autoscaling with GPU Transcription models

Speechmatics has recently switched from CPUs to GPUs to run most batch transcription models. Better hardware = increased accuracy. Find out more!
GPU Transcription Models
Adam Walford
Adam WalfordSenior Site Reliability Engineer

How Do We Measure Performance?

Speechmatics has recently moved to using GPUs to run its new transcription models. The new models are significantly more accurate, and the new hardware running them makes them significantly faster. To ensure our batch transcriptions are performing efficiently, we use the Real Time Factor (RTF). This ratio measures how quickly files are being transcribed with respect to the audio file duration.

During the development and testing of our transcription and translation models, we used the RTF value as a guideline for how efficient our new GPU models were, whether it be by speeding up transcription times or using significantly larger models to provide better accuracy for the same RTF.

While this is great, running on GPU hardware can get costly, and we want to guarantee this improved performance whilst also remaining cost-efficient.

The Challenge

Previously batch transcriptions were processed in their own isolated Kubernetes pod and only required CPU hardware. We could fit multiple transcribers onto a CPU node and each transcriber would get its own dedicated resources and not impact other running jobs.

With the move to GPU, we are now leveraging Nvidia’s Triton Server, and transcription jobs are running against a shared inference server. While it is much more performant than running on CPU, we now lose this isolation meaning multiple transcription jobs share GPU resources.

An over-run Triton server handling a large number of requests or complex audio will start to impact RTFs across all of our transcription jobs. To avoid this, we need to ensure that we have enough GPU capacity to process whatever jobs come through. With relatively unpredictable traffic, it can get expensive to run GPUs at maximum capacity constantly, instead, we need to scale our GPUs based on demand.

GPU Autoscaling on the Right Metrics

Kubernetes horizontal autoscaling is typically based of metrics such as CPU and memory, but these values were not going to paint a full picture of our transcription performance. Before rolling out to production, we ran numerous load tests against Triton to determine where a single server would begin to fail and to identify any measurable symptoms that would allow us to respond effectively.

We could not simply rely on the previously identified metrics of RTF and job rate. RTF would simply be too late as the jobs would have already been processed, and we found early on that Triton’s performance can be affected by various factors other than job rate. For instance, more complex audio with the usual job rate can still cause a bottleneck. Additionally, neither of these metrics are strictly indicative of a struggling Triton server, so we needed to find a metric from Triton that suggested performance issues which could result in an increase in RTF.

Triton exposes a /metrics endpoint which provides information about the number of inferences executed, inference duration and much more. A full list of metrics can be found here.

As Triton receives inference requests, they are placed into a queue. Triton forms a batch of inference requests from the queue and then uses the GPU to process them in parallel. When a Triton server is put under load we can start to see the relationship between size of the batches and inference queue duration.

These metrics are a much closer representation of Triton’s performance and we can clearly see where our Enhanced model can start to get backed up with inference requests. So how do we tell Kubernetes that we need more GPUs if we start to see these spikes occur?

KEDA + Prometheus

To provide more flexibility and functionality in Kubernetes autoscaling we looked into Kubernetes Event-Drive Autoscaling (KEDA), an open-source tool that translates various data sources into metrics that Kubernetes autoscaling can understand. KEDA’s Prometheus integration allows us to scale on metrics provided by Triton Server’s /metrics endpoint as it is scraped by a Prometheus server. We can use PromQL to define metrics that we wish to scale on, and KEDA will parse these metrics into an HPA which Kubernetes will understand.

We could have used Prometheus Adapter to parse these metrics, but KEDA has the flexibility of reading metrics from various data sources, which has now come in handy for scaling out other microservices in our Batch SaaS which rely on PostgreSQL.

We deployed the KEDA helm chart with its CustomResourceDefinitions (CRDs) onto each of our clusters. Then each Triton server deployment will have its own ScaledObject to tell KEDA where to get the metrics from and the query to scale on.

The ScaledObject query is taking Tritons nv_inference_queue_duration_us and nv_inference_count metrics to infer what the maximum amount of time an inference request spent in the queue. This will get the max wait time for inferences across all of our GPU nodes in the cluster. Additionally, it filters specifically for the Enhanced models so we can independently scale our Standard and Enhanced models depending on traffic load.

Once applied, this creates Kubernetes HPAs under the hood:

Additional Scaling Issues

While the elasticity of scaling is great there are still some limitations we faced with our initial scaling implementation.

We use a full GPU to run our transcription models and the Kubernetes nodes only come with 1 GPU, so scaling up Triton also means adding a new node every time. On top of this, every node will need to re-pull the Triton container image before it can start processing requests.

While testing, we found that, on average, it takes around 5-7 minutes for a new GPU node to become available in the cluster and up to 2 minutes to pull the triton image. We can see the response time of autoscaling when Triton is under stress, and the pods come online almost 10 minutes after the peak of the Triton queue duration!

This slow scale-up time would leave Triton struggling for far too long and is less than useful for a shorter burst in traffic. So we needed a solution to help speed the scale up.

AKS provides a feature that allows you to set the nodepool scale-down-mode to deallocate which does not fully destroy a VM when it scales down. Instead, it simply stops the VM and keeps the disk ready for when the node is needed again in future. This way, the node comes online much faster, and the image is already available on disk.

From testing, we can see the difference in the scale-up time. The charts below show a comparison of the duration of time a triton pod spent in a pending state before it was ready to receive inference requests.

The left column is from our original autoscaling implementation. Out of the 25 minutes that the pod is alive, it spends 40% of that time in pending, waiting for a node to be available. The charts in the right column show our scale-up after applying deallocate on scale-down, and spends just 12.5% of the time in pending of the 24 minutes that it was up, which is more than halving our scale-up time!

Conclusion

Running both transcription and translation models on GPUs can become very costly. Many businesses will look to autoscaling to help with cost efficiency. While this is a crucial feature for the long-term, it is important to consider the accuracy of the autoscaling implementation to ensure reliability is maintained under more stressful traffic loads. Autoscaling with Kubernetes has been around for a while now, but as the complexity of applications increases, there are many more factors to consider with regards to when and how we scale. Tools like KEDA are incredibly useful for providing more accurate autoscaling, and features like deallocating nodes on scale-down can help accelerate that flexibility for larger applications.

AuthorAdam Walford
AcknowledgementsAlex Wicks & Owen O'Loan

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate