
Speechmatics retained its position as Adobe Premiere's speech-to-text engine by outperforming OpenAI's free Whisper model on the same hardware
On consumer devices you can't throw hardware at the problem: we had to compress a cloud-grade model to run frugally on a laptop
Quantization was the key lever, but making the full optimization chain cooperate around it took as much work as the quantization itself
In Part One, "The Adobe Story", we looked at the challenge of getting STT working to Adobe's high standards on macOS and Windows, and pushing the boundaries of laptop GPU frameworks to support audio inference acceleration.
In this part, we look at the curve-ball thrown by OpenAI's launch of Whisper as an open-weights model that anyone can use "for free".
Starting with a top-tier cloud speech model that is more accurate than Whisper, the challenge this time was compressing the model to run frugally on laptop-class hardware.
What I'll cover in this post:
Why quantization was the central technique for bringing a cloud-grade speech model onto consumer hardware
Where the LLM-era quantization schemes transfer to other models like speech, and the parts of the surrounding optimization chain that make the work hard
The four things we did to ship
How our STT compares on a Dell XPS 16 and M1 MacBook Pro with the best Whisper apps
All figures below come from running a file with 1,038 seconds of clean audio on:
Dell XPS 16 9640 — Intel Core Ultra 7 155H, integrated Intel Arc GPU, and discrete NVIDIA RTX 4050 GPU
2020 M1 Apple MacBook Pro — integrated GPU and Neural Engine
Configuration | Runtime | Throughput | RTF | RAM | GPU |
|---|---|---|---|---|---|
ESL on CPU, 1 ORT thread | 281s | 3.7 s/s | 0.271 | 0.94 GB | — |
ESL on CPU, 4 ORT threads | 185s | 5.6 s/s | 0.178 | 1.1 GB | — |
ESL on Intel Arc GPU, single-threaded | 92s | 11.3 s/s | 0.0886 | 1.1 GB | 0.3 GB |
ESL on RTX 4050, single-threaded | 64s | 16.2 s/s | 0.0617 | 1.3 GB | 0.4 GB |
ESL on RTX 4050, multi-threaded | 41s | 25.3 s/s | 0.0395 | 1.3 GB | 0.4 GB |
For comparison on the same machine:
Configuration | Runtime | Throughput | RTF | RAM | GPU |
|---|---|---|---|---|---|
EasyWhisperUI, Whisper medium.en, CPU | ~1500s | 0.7 s/s | 1.65 | 2.7 GB | — |
EasyWhisperUI, Whisper medium.en, Intel Arc | 253s | 4.1 s/s | 0.244 | 2.9 GB | — |
EasyWhisperUI, Whisper medium.en, RTX 4050 (4 threads) | 106s | 9.8 s/s | 0.102 | 2.8 GB | 2.1 GB |
EasyWhisperUI, Whisper large-v3-turbo, RTX 4050 (4 threads) | 79s | 13.1 s/s | 0.0761 | 2.5 GB | 1.9 GB |
OfflineTranscribe (whisper.cpp), large-turbo, RTX 4050 | 47s | — | — | 1.9 GB | 1.3 GB |
OfflineTranscribe (faster-whisper), large-turbo, CPU | 824s | 1.3 s/s | 0.794 | 4.4 GB | — |
Note: OfflineTranscribe runtime and memory only — s/s and RTF not available for that row.

The RTF numbers are worth reading twice.
ESL on a single CPU thread transcribes clean audio at RTF 0.271, well under real-time with no GPU acceleration. With multi-threaded RTX 4050, ESL hits RTF 0.0395 in just 1.7 GB of total memory (1.3 GB RAM + 0.4 GB GPU).
The closest Whisper build on runtime is the large-turbo whisper.cpp build on RTX 4050 (47s vs ESL's 41s), and it uses 3.2 GB of total memory: nearly double ESL's footprint. On every like-for-like compute target, ESL is faster than Whisper medium.en on the same target:
CPU 4-thread: 5.6 s/s vs 0.7 s/s
Intel Arc: 11.3 s/s vs 4.1 s/s
RTX 4050 multi-threaded: 25.3 s/s vs 9.8 s/s
One caveat: Adobe does not enable the full set of speed optimizations available in our library. Premiere has to allow for dozens or hundreds of concurrent tasks on a min-spec machine, so the deployment leaves headroom for the rest of the application. With the full set of optimizations enabled, the library goes faster still. [VERIFY specific figure from Andrew's notes]
In 2022, OpenAI's release of Whisper transformed the STT world. Here was an open-weights model anyone could use, trained on 680,000 hours of multilingual audio, ready to transcribe in 99 languages. Architecturally it was a single end-to-end model that could run almost entirely on GPU.
Following its release, multiple organisations pitched in to improve speed and memory usage, and dozens of products sprouted up to offer more usable integrations for different use cases.
By the time we started the 2025 rebuild of our Adobe Premiere integration, apps such as MacWhisper, EasyWhisperUI and OfflineTranscribe (powered by WhisperKit, whisper.cpp and faster-whisper) had set a new bar for on-device accuracy, speed and footprint.
We had been Adobe's speech-to-text partner since 2021. To keep our partnership going, the 2025 brief was specific: bring the latest generation of our cloud models onto consumer laptops to outperform Whisper on accuracy, and match or beat it on speed, while staying inside Adobe's resource budget so that Premiere has headroom to smoothly run the dozens of other tasks it handles even on minimum-spec hardware.
“Adobe's global creator community speaks hundreds of languages and dialects. Since 2021, our partnership has focused on making sure speech technology works for everyone - whether you're editing in Scottish English, Mexican Spanish, or Cantonese. Today, millions of users can benefit from accurate transcription that works anywhere - on-device for privacy, and in the cloud for scale - without compromising performance."
"As Adobe builds toward LLM-powered creative workflows, having a speech foundation that truly understands diverse voices becomes even more critical. We're proud to be part of that future.”
Katy Wigdahl CEO, Speechmatics
Like Whisper, we have used the transformer model architecture as the foundation of our STT cloud products since 2022. Unlike Whisper, our acoustic foundation model is trained with self-supervision, so we can pretrain with millions of hours of audio-only data that has no transcript. This gives us on the order of 100x data efficiency when training using audio with matching transcripts.
This data scale and efficiency advantage, combined with model size increases allowed by GPU inference, has enabled us to comfortably beat Whisper with our cloud service accuracy. Those are the models we now needed on a laptop.
The GenAI explosion started by ChatGPT has been driving a huge industry-wide investment in optimizing transformers to fit bigger models on the same servers, and to fit existing models on smaller devices. Quantization has become a central technique in that body of work as it helps with both cases.
Applied effectively, it can reduce the memory a model needs by a factor of eight while sacrificing an amount of accuracy that is barely perceptible, with the memory saving often resulting in a speed-up as well. Applied badly, the results are catastrophic: horribly wrong output or even complete inference failure.
For us, it quickly became clear that quantization was the lever that would get a cloud-grade model to fit on consumer hardware at accuracy close to the cloud version.
The real trick, though, was getting everything else in the optimization chain to behave so that quantization actually delivered its full potential.
We set out to apply the latest generation of quantization schemes developed for LLMs to speech processing. On the face of it, this sounds easy: just put the model through one of the various popular tools and out will pop a perfectly formed gem!
The reality was not so simple. Yes, the weights were easily quantized to make a small model, but wrangling the surrounding optimization chain so as not to break the model is where the real time went.
Three specifics worth calling out.
Model optimization, particularly for GPUs and other hardware accelerators, is a chain of steps applied at different points by different components:
Export from PyTorch to an intermediate portable format (ONNX)
Generic optimizations in ONNX such as quantization, shape inference and operator fusion
Hardware-specific optimization steps such as memory layout and kernel choices (CoreML on macOS; DirectML via ONNX Runtime on Windows)
Each step is written to do the best job it can, and the companies involved are motivated to show off good results. In practice, this means ensuring that popular reference models like resnet50 or distilbert work really well in benchmarks.
The problem is that most optimizations work by recognising specific patterns in the model's inference graph in order to replace small groups of operations with a single fused operation that is mathematically equivalent but optimized for the target framework and hardware. Unfortunately, specific patterns may only appear exactly for popular reference models, making the pattern-matching logic brittle and increasing the risk of errors.
This brittleness compounds as the inference graph is rewritten by the various optimization steps. Changes to one step risk confusing later steps and preventing optimizations from being applied, or from being applied correctly.
Even seemingly small changes in a model's definition, such as a subtle tweak to the transformer layers to work better with audio, can make the final inference graph look surprisingly different. This is a real problem since some of the final optimization stages happen inside the GPU driver where only a tiny bit of the graph is visible at any one moment.
This risk is not theoretical. We found that some of the most important optimizations would be silently skipped, while others could actively break the model due to bugs in the tooling or simply because of the inherently risky nature of some optimization techniques.
It's worth noting that the experience here is very different to normal programming language compilers, where the optimizer is conservative about preserving semantics unless you explicitly give it permission to relax key assumptions. In ML optimization, it can feel like surprises are intended: popular tools change semantics by design and potentially quite drastically.
PyTorch V2 export using torch.compile is the clearest example: it deliberately only preserves the tensor operations it saw when tracing one reference input, and will discard conditional logic and ancillary calculations on the assumption that the only thing that matters is the sequence of tensor operations.
Using torch.compile is a bit like hiring a moving company and finding out they rebuilt a new house with everything fixed in the exact positions they saw during the tour. Doors glued shut, oven shelves welded in place, whole rooms missing because they were not part of the walk-through. Useful, until it is not!
Quantization works by shrinking the precision and range of the weights that encode the model's knowledge and thus determine its accuracy. In many respects, this is as risky as it sounds: it is literally exponentially increasing the chance of overflow or underflow in individual operations. Either one can cause a complete collapse in accuracy.
Even if that is avoided, precision loss risks corrupting the calculations in subtle ways that might degrade accuracy, but only in certain situations.
Despite these risks, the technique is often workable in practice because many model layers end up with nicely balanced weights (if model training went well), meaning they have a small range which can be captured with fewer bits, enabling significant savings in memory and potentially in compute time.
Tiny errors at each layer injected by precision reduction can be mitigated during training with quantization-aware training techniques. New techniques are being frequently developed that can do a better job, and advanced block data types have been introduced with hardware support to try and keep slightly more precision with equivalent compression.
No wonder: the stakes are high, as quantization with minimal accuracy loss is a potent lever for improving the economics of AI model deployment.
For the AI model developer, this doesn't happen for free. It is essential to:
Identify every section of the model where precision and range loss cannot be tolerated, and instruct the quantization tool to leave those alone
Consider the cumulative effect across many layers: even where per-layer error is individually survivable, it can push final accuracy past the edge of acceptable
Ensure the quantization changes to the inference graph don't trip up the other optimization steps
In our experience, this process is highly specific to the model and framework involved, and takes real time because of the iterations and debugging needed to reach a good place. You need engineers who deeply understand the model's architecture and can dig deep to inspect layers and ops, check the inference graph for correctness and optimality, and run good tests that will spot subtle changes in the distribution of model predictions. You may need to write scripts that systematically compare the unquantized and quantized activations layer-by-layer to spot where errors creep in.

There were four key decisions we took that are worth exploring here.
The state of the art for ML optimization and quantization is constantly evolving as developers and companies scramble to find an edge. There is no single approach that works universally or every time.
In our case, Apple and Windows devices and frameworks are different enough that we needed separate efforts to do each one justice. We spent several months optimizing our models for CoreML in order to efficiently leverage both the GPU and the Neural Engine on Apple devices. WhisperKit had already achieved excellent results by applying insights from Apple's ANE Transformer paper, which we were able to replicate.
It took a similar amount of effort to do the equivalent for Windows PCs, this time relying on DirectML and the ONNX Runtime to support any GPU with appropriate Windows drivers. Both framework choices had been stipulated by Adobe to ensure the best possible compatibility and reliability, based on their years of experience running ML models on millions of consumer devices.
Early on, we decided to quantize our cloud service models rather than training models with fewer parameters and basic float16 quantization, as that would struggle to achieve similar accuracy even using distillation to transfer knowledge from our biggest models.
We investigated the latest quantization schemes developed for on-device LLMs like Llama, Gemma, Phi, Qwen and GPT-OSS as the starting point. There is a bewildering array of tools offering different techniques for quantization (often specific to particular inference frameworks), making it tricky to figure out which one was most appropriate for our models.
In the end, we used 6-bit palettization in CoreML for macOS, while on Windows we used INT4 weight-only quantization in DirectML/ORT for decent compression with good hardware support.
The essential step in quantization is finding and isolating the sections of the model where a loss of precision or range cannot be tolerated, and telling the quantization tool not to modify those layers.
Even where the reduction of precision and range can be tolerated for an individual group of operations, the cumulative effect can unacceptably degrade the accuracy of the model's final output. The work is very model-specific and can be quite intensive and time-consuming to get to good compression with minimal accuracy degradation.
To deal with the various quirks, bugs and limitations of the multi-vendor toolchains, we built our own export and optimization scripts to ensure that specific fused operations were always used in the final optimized graph, without leaving it to the mercy of the pattern-recognition process across the various optimization phases in different components.
These directed optimization steps were crucial for delivering models that reliably achieved the desired accuracy, speed and memory use on the diverse range of devices which matters to Adobe.
We accepted one trade-off explicitly: some very old, low-powered GPUs do not have the hardware support needed to run quantized models efficiently, so on those devices the new model is slower than the previous one, though still much more accurate.
In short, there was no way around it: we needed to master each tech stack at pretty much all levels and steps of the process to get these results.
To illustrate how much difference it can make to the model inference graph when optimization steps are controlled properly, here is a tiny section of our exported acoustic foundation model graph: as seen early in our development cycle, compared to a similar section after we had learned to control the export, quantization and optimization steps effectively.
Before. The inference graph we obtained from our original approach to ONNX export was frankly a mess: polluted with lots of low-level data manipulation operators and clumsy constructions to implement key elements like Multi-Head Attention (a core mechanism in most Transformer models) in terms of discrete operations.
In this snippet, the generic version of the MatMul operator is used to perform the heavy compute steps using float16 values (the simplest form of quantization that can be applied with little risk). There is very little knowledge of data shapes captured in the graph, which is critical for picking the right MatMul kernels for best performance, so data shapes have to be determined at runtime by scanning the entire graph when the input data is provided, and in some cases leaving shape computations to be performed explicitly as part of inference.

After. The inference graph is now fully annotated with data shapes, since input shape sizes have been fixed, allowing all shape computations to be performed during export and optimization.
A recently added ONNX operator is now used for Multi-Head Attention, replacing the inelegant and inefficient discrete structure in the before case, enabling the GPU driver to invoke a fully optimized kernel that fuses the relevant operators.
Similarly, a quantized MatMulNBits operator is now invoked to compute directly using quantized weights, ensuring the full memory reduction effect is achieved with optimal compute speed by allowing the GPU driver to map to the best available GPU hardware capability. We ended up not being able to rely on the optimization chain to achieve this, and resorted to explicitly forcing this operator to be used when exporting from PyTorch.

This much better optimized inference graph is mirrored by much better runtime GPU utilization with a corresponding reduction in CPU graph processing.
To help understand how well GPU inference is working, it is invaluable to use a profiling tool like Nsight Systems from NVIDIA. In our case we are relying on DirectX 12 via the DirectML API layer to access the GPU. Nsight Systems supports a drop-in DLL that surfaces DML operations as PIX markers, allowing drill-down into how long each individual operation takes during inference.
Of particular interest is the split between CPU and GPU processing for a single model inference request, as shown by the highlighted PIX marker rows.
Early unquantized model (default settings). From the trace we saw that inference processing of our primary GPU model was rather inefficient. The area circled in red near the bottom shows an extended period of CPU processing which happens on every inference request. Drilling into the details, this indicates a slow path is taken in DirectML where the inference graph is scanned and basic optimizations performed on every request (mostly shape inference, now that the input tensor shapes are known). The majority of the actual GPU execution takes place in a burst at the end, with a few tiny sections interleaved during the CPU section.

Incomplete optimization. This profiling run gives an insight into the problems we were faced with even towards the end of our Windows DirectML optimization work. At this point we had managed to get quantization working to good accuracy, solved problems with inference crashing, and even had okay inference speed. However the Nsight trace shows there is still a significant inefficiency hampering our speed.
Two problems are immediately visible:
Red (bottom left): A significant phase of DML activity on CPU. DirectML was unable to build an optimized GPU execution plan when the model was loaded, so a graph scan is still happening on each inference. (This also happens for the secondary model.)
Green (middle): A phase of intermittent GPU and CPU activity towards the end of the main GPU inference phase. Fallback of certain graph operations to CPU implementations, which run more slowly and require data to be copied back and forth between GPU and CPU. The PIX markers indicate these are tensor data manipulation operations, linked to where our model is preparing output tensors after the main inference steps.
After full optimization. This profiling run shows the performance of the final model after quantization and the full set of optimization steps implemented with the help of our custom scripts. At this point we had also recoded the PyTorch steps for building the output state tensor that was identified as inefficient in the previous trace.
In this case, the CPU processing phase circled in red is extremely quick: it is submitting a pre-prepared DirectML inference execution plan that was generated once when the model was first loaded. In the GPU PIX marker row circled in green, we see that DML is explicitly processing an execution plan made up of individual operations running back-to-back with no gaps. During this period, there is no further CPU activity (thread utilization drops to zero), indicating there were no cases where execution had to fall back to a CPU operator implementation.

Future potential. There is one final piece of the story which becomes clear when we zoom out from a single request to see a sequence of requests and the gaps between them.
This longer trace shows cycles of the primary and secondary model inference happening very efficiently on GPU (circled in green), but with short bursts of CPU activity between individual cycles, and longer bursts after a series of cycles (shown on the left side before the start of a new series).
The bigger bursts (circled in red) reflect work done in other areas of our STT engine: implementing sophisticated algorithms that take the core model predictions and build multiple transcript hypotheses, which are then scored using an additional language model and sifted to find the most likely sequence of words. To address this, we added a multi-threading option to our engine that allows the GPU inference cycles and CPU rescoring work to be overlapped on different threads, giving faster turnaround though with occasional periods where both threads are busy on CPU.
The smaller bursts of CPU activity (circled in turquoise) have an unexpected cause: these are memory copies performed by CPU. It turns out that GPU inference of a 250k parameter quantized model can run so fast that it takes about as long for the CPU to copy the intermediate outputs from the primary model and provide them as input to the secondary model as it did for the GPU to process an entire audio chunk!
This inefficiency could be addressed in various ways: for instance, by using an advanced ONNX runtime mechanism to keep the intermediate outputs on GPU, or by combining the models into a single model.

If you are shipping a non-LLM transformer model onto consumer hardware, five things from this project are worth internalising before you start.
Treat quantization as one stage in a chain. Export, quantization, generic graph optimization and hardware-specific optimization all have to cooperate for the quantization gain to land on the target hardware.
The LLM quantization schemes transfer; the defaults often do not. You can reuse the machinery from Llama, Gemma, Phi, Qwen and friends, but you may still need directed, model-specific work on top of it, as we did for our speech model.
Isolate the precision-sensitive sections first. Find the parts of the model where overflow or underflow would collapse accuracy and tell the quantization tool to leave them alone.
Be careful with tools that are not semantic-preserving. torch.compile is a useful example: it is designed to preserve the tensor operations it saw on a reference input and can discard conditional logic and ancillary calculations in the process.
If you ship to non-datacentre hardware, expect to write custom export scripts. Most optimization tooling is pattern-matched against reference models. Audio-transformer architectures will not always match, and some of the most important optimizations silently fail to apply; others can break the model outright.

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Voice agents that pass in demos routinely fail in production. This guide covers the 11 best voice agent testing platforms in 2026, with the Five-Layer Testing Framework, platform deep dives, open-source alternatives, and a decision guide by maturity stage.

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.
