Blog - Technical
May 16, 2023 | Read time 7 min

Autoscaling with GPU Transcription models

Speechmatics has recently switched from CPUs to GPUs to run most batch transcription models. Better hardware = increased accuracy. Find out more!
Adam WalfordSenior Site Reliability Engineer

How Do We Measure Performance?

Speechmatics has recently moved to using GPUs to run its new transcription models. The new models are significantly more accurate, and the new hardware running them makes them significantly faster. To ensure our batch transcriptions are performing efficiently, we use the Real Time Factor (RTF). This ratio measures how quickly files are being transcribed with respect to the audio file duration.

During the development and testing of our transcription and translation models, we used the RTF value as a guideline for how efficient our new GPU models were, whether it be by speeding up transcription times or using significantly larger models to provide better accuracy for the same RTF.

While this is great, running on GPU hardware can get costly, and we want to guarantee this improved performance whilst also remaining cost-efficient.

The Challenge

Previously batch transcriptions were processed in their own isolated Kubernetes pod and only required CPU hardware. We could fit multiple transcribers onto a CPU node and each transcriber would get its own dedicated resources and not impact other running jobs.

With the move to GPU, we are now leveraging Nvidia’s Triton Server, and transcription jobs are running against a shared inference server. While it is much more performant than running on CPU, we now lose this isolation meaning multiple transcription jobs share GPU resources.

An over-run Triton server handling a large number of requests or complex audio will start to impact RTFs across all of our transcription jobs. To avoid this, we need to ensure that we have enough GPU capacity to process whatever jobs come through. With relatively unpredictable traffic, it can get expensive to run GPUs at maximum capacity constantly, instead, we need to scale our GPUs based on demand.

GPU Autoscaling on the Right Metrics

Kubernetes horizontal autoscaling is typically based of metrics such as CPU and memory, but these values were not going to paint a full picture of our transcription performance. Before rolling out to production, we ran numerous load tests against Triton to determine where a single server would begin to fail and to identify any measurable symptoms that would allow us to respond effectively.

We could not simply rely on the previously identified metrics of RTF and job rate. RTF would simply be too late as the jobs would have already been processed, and we found early on that Triton’s performance can be affected by various factors other than job rate. For instance, more complex audio with the usual job rate can still cause a bottleneck. Additionally, neither of these metrics are strictly indicative of a struggling Triton server, so we needed to find a metric from Triton that suggested performance issues which could result in an increase in RTF.

Triton exposes a /metrics endpoint which provides information about the number of inferences executed, inference duration and much more. A full list of metrics can be found here.

As Triton receives inference requests, they are placed into a queue. Triton forms a batch of inference requests from the queue and then uses the GPU to process them in parallel. When a Triton server is put under load we can start to see the relationship between size of the batches and inference queue duration.

These metrics are a much closer representation of Triton’s performance and we can clearly see where our Enhanced model can start to get backed up with inference requests. So how do we tell Kubernetes that we need more GPUs if we start to see these spikes occur?

KEDA + Prometheus

To provide more flexibility and functionality in Kubernetes autoscaling we looked into Kubernetes Event-Drive Autoscaling (KEDA), an open-source tool that translates various data sources into metrics that Kubernetes autoscaling can understand. KEDA’s Prometheus integration allows us to scale on metrics provided by Triton Server’s /metrics endpoint as it is scraped by a Prometheus server. We can use PromQL to define metrics that we wish to scale on, and KEDA will parse these metrics into an HPA which Kubernetes will understand.

We could have used Prometheus Adapter to parse these metrics, but KEDA has the flexibility of reading metrics from various data sources, which has now come in handy for scaling out other microservices in our Batch SaaS which rely on PostgreSQL.

We deployed the KEDA helm chart with its CustomResourceDefinitions (CRDs) onto each of our clusters. Then each Triton server deployment will have its own ScaledObject to tell KEDA where to get the metrics from and the query to scale on.

The ScaledObject query is taking Tritons nv_inference_queue_duration_us and nv_inference_count metrics to infer what the maximum amount of time an inference request spent in the queue. This will get the max wait time for inferences across all of our GPU nodes in the cluster. Additionally, it filters specifically for the Enhanced models so we can independently scale our Standard and Enhanced models depending on traffic load.

Once applied, this creates Kubernetes HPAs under the hood:

Additional Scaling Issues

While the elasticity of scaling is great there are still some limitations we faced with our initial scaling implementation.

We use a full GPU to run our transcription models and the Kubernetes nodes only come with 1 GPU, so scaling up Triton also means adding a new node every time. On top of this, every node will need to re-pull the Triton container image before it can start processing requests.

While testing, we found that, on average, it takes around 5-7 minutes for a new GPU node to become available in the cluster and up to 2 minutes to pull the triton image. We can see the response time of autoscaling when Triton is under stress, and the pods come online almost 10 minutes after the peak of the Triton queue duration!

This slow scale-up time would leave Triton struggling for far too long and is less than useful for a shorter burst in traffic. So we needed a solution to help speed the scale up.

AKS provides a feature that allows you to set the nodepool scale-down-mode to deallocate which does not fully destroy a VM when it scales down. Instead, it simply stops the VM and keeps the disk ready for when the node is needed again in future. This way, the node comes online much faster, and the image is already available on disk.

From testing, we can see the difference in the scale-up time. The charts below show a comparison of the duration of time a triton pod spent in a pending state before it was ready to receive inference requests.

The left column is from our original autoscaling implementation. Out of the 25 minutes that the pod is alive, it spends 40% of that time in pending, waiting for a node to be available. The charts in the right column show our scale-up after applying deallocate on scale-down, and spends just 12.5% of the time in pending of the 24 minutes that it was up, which is more than halving our scale-up time!


Running both transcription and translation models on GPUs can become very costly. Many businesses will look to autoscaling to help with cost efficiency. While this is a crucial feature for the long-term, it is important to consider the accuracy of the autoscaling implementation to ensure reliability is maintained under more stressful traffic loads. Autoscaling with Kubernetes has been around for a while now, but as the complexity of applications increases, there are many more factors to consider with regards to when and how we scale. Tools like KEDA are incredibly useful for providing more accurate autoscaling, and features like deallocating nodes on scale-down can help accelerate that flexibility for larger applications.

AuthorAdam Walford
AcknowledgementsAlex Wicks & Owen O'Loan

Related Articles