How Do We Measure Performance?
Speechmatics has recently moved to using GPUs to run its new transcription models. The new models are significantly more accurate, and the new hardware running them makes them significantly faster. To ensure our batch transcriptions are performing efficiently, we use the Real Time Factor (RTF). This ratio measures how quickly files are being transcribed with respect to the audio file duration.
During the development and testing of our transcription and translation models, we used the RTF value as a guideline for how efficient our new GPU models were, whether it be by speeding up transcription times or using significantly larger models to provide better accuracy for the same RTF.
While this is great, running on GPU hardware can get costly, and we want to guarantee this improved performance whilst also remaining cost-efficient.
The Challenge
Previously batch transcriptions were processed in their own isolated Kubernetes pod and only required CPU hardware. We could fit multiple transcribers onto a CPU node and each transcriber would get its own dedicated resources and not impact other running jobs.
With the move to GPU, we are now leveraging Nvidia’s Triton Server, and transcription jobs are running against a shared inference server. While it is much more performant than running on CPU, we now lose this isolation meaning multiple transcription jobs share GPU resources.
An over-run Triton server handling a large number of requests or complex audio will start to impact RTFs across all of our transcription jobs. To avoid this, we need to ensure that we have enough GPU capacity to process whatever jobs come through. With relatively unpredictable traffic, it can get expensive to run GPUs at maximum capacity constantly, instead, we need to scale our GPUs based on demand.
GPU Autoscaling on the Right Metrics
Kubernetes horizontal autoscaling is typically based of metrics such as CPU and memory, but these values were not going to paint a full picture of our transcription performance. Before rolling out to production, we ran numerous load tests against Triton to determine where a single server would begin to fail and to identify any measurable symptoms that would allow us to respond effectively.
We could not simply rely on the previously identified metrics of RTF and job rate. RTF would simply be too late as the jobs would have already been processed, and we found early on that Triton’s performance can be affected by various factors other than job rate. For instance, more complex audio with the usual job rate can still cause a bottleneck. Additionally, neither of these metrics are strictly indicative of a struggling Triton server, so we needed to find a metric from Triton that suggested performance issues which could result in an increase in RTF.
Triton exposes a /metrics
endpoint which provides information about the number of inferences executed, inference duration and much more. A full list of metrics can be found here.
As Triton receives inference requests, they are placed into a queue. Triton forms a batch of inference requests from the queue and then uses the GPU to process them in parallel. When a Triton server is put under load we can start to see the relationship between size of the batches and inference queue duration.