
Winning the speech-to-text provider seat in Adobe Premiere is no mean feat; fending off the free OpenAI Whisper model to keep it is next level. In the cloud you can throw hardware at the problem, but running on-device speech recognition on millions of consumer devices around the world doesn’t offer that luxury.
Please indulge me: I’ve waited five years for the opportunity to tell the world about our achievements in becoming – and remaining – a core part of Adobe’s hugely popular video editing product, Premiere. There’s a lot of ground to cover.
In late 2020 Adobe was pressing ahead with an ambitious vision for transcript-driven video editing that promised to transform the productivity of this burgeoning task. They were already locked into a 2021 launch date with another STT vendor who would deliver a library for in-app transcription on macOS and Windows.
Astonishingly, our chief sales maestro had persuaded them to give us a chance to impress at the 11th hour. They were satisfied our cloud product had the highest accuracy, but could we make it work on macOS and Windows and meet all of their integration requirements?
Answering that question convincingly took months of the toughest negotiations I’ve ever faced, with the immovable launch date getting ever closer. The negotiations were ultimately about two things: proving to Adobe that we could deliver on time, while ensuring we didn’t derail our own ambitious innovation plans.
Spoiler alert: we succeeded on both fronts. Fast-forward to 2025 and a new gauntlet was thrown down: beat Whisper to remain relevant. Those ambitious innovation plans from 2021 were the key to winning again.
This is how we did it the first time, and then did it again.
An astute and demanding customer
Shrinking a cloud product to run on laptops
Leveraging laptop GPUs
Whisper changes the game
Challenge accepted and achievement unlocked!
An astute and demanding customer
Adobe is famous for dominating multiple global businesses, reinventing and uplevelling their tech and products to stay relevant for more than 40 years as Desktop Publishing became a thing and then faced the onslaught of successive Internet waves. Today a product like Premiere is as relevant as ever as video content production has exploded astronomically. It has stayed relevant by constantly pushing forward on technical capabilities, including the use of Machine Learning models for manipulating video, images, audio and text.
This makes the Adobe engineering team adept at understanding ML components and how to integrate dozens, perhaps hundreds, of ML models into a sophisticated program that is a workhorse for millions of creative industry professionals. Keeping Premiere responsive and fluid even on the minimum spec hardware is something that Adobe takes very seriously – so much so that their min-spec requirements are now permanently inscribed on my brain.
In a nutshell, Adobe needed our STT library to run using a fraction of the resources of a min-spec machine, with the ability to be instantly paused and resumed, while delivering best-in-class transcript accuracy including professional formatting of numbers, dates, addresses, brand names and more, with lip-synch accuracy for word timings. The aim is to minimize the amount of post-transcription editing work falling to the customer – a metric that is closely monitored by Adobe.
A critical technical requirement Adobe imposed at the outset is complete control over model loading and inference. With Premiere juggling dozens of models running background tasks, precise control is essential to free the GPU in milliseconds when the user presses play to see their latest edits.
A final requirement was hardware compatibility – Premiere runs on millions of machines of many different vintages and is expected to work flawlessly. At this scale, you have to be prepared to deal with bugs even in the PC hardware itself!

Our biggest technical challenge in 2021 was that we were starting with a Linux product designed to run on cloud servers; we needed to both port it to macOS and Windows and shrink its memory use by 75% (to 1GB instead of the 4GB available in cloud). Disk footprint was equally important; Adobe’s customers would see the download and storage cost of each language pack they selected.
Porting turned out to be the easy bit, as the core of the product was plain C++ code that recompiled easily on different platforms. Reducing the memory footprint was trickier, though possible because it had simply never been a concern before. In the end, about half a dozen component changes were needed to switch to memory-frugal approaches.
Re-exporting all of our models to use the ONNX Runtime in Premiere itself helped, and switching our engine internally to work in streaming mode (intended for realtime operation) avoided big memory spikes seen in batch mode. Thankfully, our engine and models had been designed from the start to support both modes, and the accuracy differences were modest.
To help minimize differences from our cloud server product, we incorporated most of the changes into the common code shared with the Linux products. It meant dealing with some painful consequences of having to export models to ONNX format, rather than just running our PyTorch models using Torchscript with all the convenience that offers.
Although it was a very intense experience we hope not to repeat, we were able to achieve an acceptable footprint (along with all of the other accuracy and formatting criteria) despite having a development phase of only four months from contract signature to final acceptance. Adobe told us later they were astonished that we managed to deliver on time; they had assumed we would overrun significantly.
One part of Adobe’s ideal requirements hadn’t made the cut for initial delivery, but was still top of their wish list: the ability to use the GPU (or the Neural Engine on M-class Macs) for STT model inference. The expectation is that GPU inference will improve throughput and offload work from the CPU, making it easier for other parts of Premiere to have the CPU power they need to keep the application responsive and fluid.
We agreed to take on this second challenge, not appreciating at first just how difficult it would be. I remember vividly the first time I tried “just turning on the GPU option” in the ONNX Runtime, to discover the model ran significantly slower than on CPU (while using a ton of extra memory). On macOS, my recollection is that export to the CoreML framework didn’t work at all.
In hindsight, I see now this was a foretaste of what was to come.
The details of eventually getting there are boring but instructive: it took about a year of chasing Adobe’s hardware and software partners to get various enhancements and bug fixes into the ML inference frameworks and GPU drivers that Adobe uses on both platforms. Even once the models were able to export and run without errors or crashes, it took weeks to tweak our models to avoid the dreaded condition of CPU fallback – where a step in the model won’t run on GPU and has to switch to CPU and back again, tanking performance.
The lesson: audio is not the same as images or text when it comes to ML models. A framework may work great for optimizing image processing models say (one of the earliest success stories for ML), but fail miserably for models that work a bit differently.
It turns out that the corollary to this lesson holds true even now: it takes hard work to get state of the art results for STT models, seemingly on every different platform we try.
Stay tuned for part two: Faster than Whisper: how Adobe Premiere's on-device speech engine got rebuilt

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

The joint platform returns transcription and health signals in real time, with no additional hardware required.
![[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F4jGjYveRLo3sKjzBzMIXXa%2F11e90a40df418658e9c15cb1ecff4e4b%2FBlog_image-wide-carousel.webp&w=3840&q=75)
What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.
![[alt: Logo design featuring the text "SPEECHMATICS" alongside a stylized logo for "Cekura," set against a soft green background with subtle curved lines.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F39N1Yr95B2jvfd7JKGihq0%2F7b1ca5f8d5db0235b64829dcab16b96a%2FSpeechmatics_partners_with_Cekura-wide-carousel.webp&w=3840&q=75)
A new integration gives agent developers a QA layer built for the complexity of the real world.

