Mar 28, 2023 | Read time 8 min

How to Accurately Time CUDA Kernels in Pytorch

In a world of increasingly costly machine learning model deployments, ensuring accurate GPU operation timing is key to resource optimization. In this blog post, we explore best practices to achieve this in PyTorch.
PyTorch timing operations
Lawrence Atkins
Lawrence AtkinsMachine Learning Engineer
David MacLeod
David MacLeodMachine Learning Architect

Table of Contents

1) Introduction 2) Host-Device Synchronization 3) CUDA events 4) Warmup steps 5) Fixed clocks 6) Cache flush 7) Sleep / CUDA graphs

Introduction

If we know anything about machine learning in 2023, it is this: bigger is better. Give your model more data, parameters, and compute, and success is (somewhat) guaranteed[1].

However, larger models are both memory-hungry and slow. To combat this, a range of techniques exist that minimise training and inference compute, thus lowering costs. Two examples are FlashAttention[2] and ZeroQuant[3]. Regardless of the approach, the ability to accurately time individual operations in a computational graph is essential.

Doing so isn't trivial when GPUs are involved. In this blog, we present a comprehensive guide to the tips & tricks required to get accurate and repeatable results. Most are specific to PyTorch, but the principles discussed apply to CUDA programming in general.

Host-Device Synchronization

Our starting point is host-device synchronization.

PyTorch executes GPU kernels asynchronously. While a CUDA kernel runs on GPU, the CPU continues to queue up further kernels behind it. This avoids being bottlenecked by general overhead costs such as launching kernels and those associated with the Python interpreter.

It also has implications for timing GPU operations. A naïve approach may end up timing the kernel launch instead of kernel execution, like so:

The common solution is to call torch.cuda.synchronize() before taking a timing measurement. This waits for all kernels in all CUDA streams to complete:

Here's an example in PyTorch:

CUDA events

When combining explicit synchronization points with perf_counter, we don't just time kernel execution. We also include some overhead associated with kernel launch. Furthermore, using synchronization points may not be desirable when profiling a performance-critical workload due to slowdowns incurred.

CUDA Events are a neat way to avoid unnecessary synchronization points and hide kernel launch overhead. Here's an example:

We begin by creating two lists of torch.cuda.Event() objects. The record() method essentially puts a time stamp in the stream of kernel execution. We do so before and after the operations that we wish to time. At the end of the for loop, we must include a synchronize() statement before running s.elapsed_time(e). Omitting this means that the CPU would attempt to calculate the elapsed time before the GPU has finished its work, yielding a RuntimeError.

This image illustrates these ideas for steps = 2:

Warm-Up Steps

A further improvement we can make to our above examples is to include warmup steps prior to timed runs. This is needed to discard the overheads only incurred at the start of a training or inference run. Examples include:

  • Optimization passes / codegen applied by PyTorch's JIT fuser after the first few input tensors are encountered.
  • On-the-fly microbenchmarking carried out by torch.cudnn.benchmark when selecting optimal convolution kernel for a given input shape.
  • Lazy loading of kernels into the CUDA context with CUDA_MODULE_LOADING=LAZY & CUDA 11.7+
  • Overhead of cudaMalloc calls by PyTorch's caching allocator to initially grow the memory pool, ready for later re-use.

Here's an example:

Fixed Clocks

So far, we have focused on making our profiling results accurate. But how can we make them reproducible? GPU clock speed can vary significantly according to limits on temperature and power consumption. As such, fixing the clock enables consistent and reproducible benchmarking.

Here's an example of how to implement this in Python. We use a similar approach to that of OpenAI's Triton Domain-Specific Language (DSL) and Deepmind's AlphaTensor[4] repositories.

One caveat is that selecting a clock speed with nvidia-smi doesn't guarantee that your GPU will run at the requested speed. The GPU always retains the ability to decrease the clock rate (throttling) to prevent damage to the hardware. But by setting the clock speed to a value sufficiently below the maximum, we can ensure that throttling is less severe.

Cache Flush

Another important consideration is ensuring that the GPU memory caches are cleared between timing calls. This avoids the possibility of repeated kernel executions exploiting cache hits and artificially reducing latency. One simple solution is to pass different input data for each pass, but we need to be careful that we are covering all bases. For example, when profiling a torch.nn.Linear it may be insufficient to swap out the input data as some of the (static) weights could still persist in the cache across runs.

If the input tensors are large, the constant recreation could also slow down the dev loop. A more robust solution is to explicitly flush the cache between passes. The example below is based on Triton DSL. It works by writing sufficient data such that any existing cache lines are overwritten, as the L2 cache on Nvidia GPUs uses a write-back policy which means the zeros data will initially be written to the L2 cache.

Sleep / CUDA Graphs

We previously saw that CUDA events hide the overhead of launching a kernel (the fixed time between the host launching a kernel and it being executed on the GPU). However, this is not a silver bullet, as it assumes that there is no time gap between the kernel in question and the surrounding CUDA events in the command queue. That is, it assumes the preceding CUDA event completes immediately before the kernel is due to be executed, and the following CUDA event starts as soon as the kernel is complete.

When we are timing lightweight kernels that are fast to execute, this assumption can break down. Kernel execution may be quicker than kernel launch, meaning the GPU "outruns" the CPU. This can cause spurious results which contain launch overhead in the CUDA events delta, as illustrated here:

Luckily, there are solutions. The simplest is to saturate the command queue prior to launching the target kernel. This ensures that the kernel and its events are enqueued together rather than being executed before the next command has a chance to make it onto the queue:

How should we actually do this? A naïve approach is to launch a sufficiently expensive kernel prior to the operations we are interested in, thus creating a backlog. A cleaner solution is to ask the GPU to wait for a fixed number of instruction cycles, either by using CUDA's __nanosleep or torch.cuda._sleep() :

A second solution is to use CUDA graphs. This minimizes launch overhead by joining a series of independent kernel launches into a single kernel. Note that we execute the target kernel multiple times within the graph capture to amortize the cost of the launch overhead:

References[1] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).

[2] Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness." Advances in Neural Information Processing Systems 35 (2022): 16344-16359.

[3] Yao, Zhewei, et al. "ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers." Advances in Neural Information Processing Systems 35 (2022): 27168-27183.

[4] Fawzi, Alhussein, et al. "Discovering faster matrix multiplication algorithms with reinforcement learning." Nature 610.7930 (2022): 47-53.
AuthorsLawrence Atkins & David MacLeod
AcknowledgementsCaroline Dockes, Ed Rees, Ellena Reid & Markus Hennerbichler

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Mieke Smith
Mieke SmithSenior Writer
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR