Nov 17, 2022 | Read time 5 min

How to Successfully Achieve Multinode Training in PyTorch

Join Speechmatics Machine Learning Engineer, Ellena Reid, as she looks at the best way to turn a single node training setup into a robust, platform-agnostic, multinodal one.
Nina Vanilla style 550x413px Webite Vanilla 550x413 PYTHON -F9FAFB
Ellena Reid
Ellena ReidMachine Learning Engineer

Training large neural networks requires plenty of time – and plenty of compute. The temptation for many is to speed things up simply by throwing more compute at the problem. But without careful implementation this can do more harm than good. What’s more, most companies don’t own enough GPUs to train these large models on internal infrastructure. Instead, they rely on the cloud.

In this post, we’ll discuss how to turn a single node training setup into a robust, platform agnostic, multinodal one.

A Question of Quantity

The first question we can ask is, how many nodes is too many nodes? To begin, we need to consider when and why training across multiple nodes leads to faster training. Assuming we can fit our entire model on a single GPU (a valid assumption for models up to ~5B parameters on an A100), we can use the extra compute to increase our effective batch size, where

effective_batch_size = local_batch_size * grad_acc_steps * n_gpus_per_node * n_nodes.

If we want to train a model with a batch size of 128, but can only fit a batch size of 4 on an individual GPU, we have 3 choices:

1. Increase the number of gradient accumulation (grad_acc) steps

eg) 128 = 4 * 32 * 1 * 1

2. Increase the number of GPUs per node

eg) 128 = 4 * 4 * 8 * 1

3. Increase the number of nodes

eg) 128 = 4 * 1 * 8 * 4

Algorithmically, the above approaches are equivalent and will produce the same losses. Option 3 will simply get there faster (assuming our network bandwidth is high enough, see below). Notice that we always use the maximum local batch size we can fit in memory, then use the extra compute to decrease the number of gradient accumulation steps.

In Option 3, we’re using 32 GPUs (4 nodes, 8 GPUs per node). At this point, increasing the number of nodes further would be futile, as the number of gradient accumulation steps cannot be decreased further. It is also worth noting that even increasing the effective batch size beyond a certain point (the critical batch size) does not improve convergence; in this regime you gain nothing by increasing the number of nodes further.

Trivial as this may seem, note that varying the number of nodes does not require varying the effective batch size. This point will be crucial in a future post, where we will discuss the use of distributed checkpoints with elastic jobs.

Networking and Environment

While multinode training may be algorithmically the same as single node training, it poses some engineering challenges. During the backwards pass, the gradients must sync across all nodes. In single node training, GPUs are typically connected via NVLinks/NVSwitches to ensure fast communication. In multi-node training, however, they must communicate over a network. You could simply use ethernet, but this will severely bottleneck your training throughput.

Alternatively, if you use the combination of InfiniBand and RDMA, near linear scaling across nodes can be achieved. Remote Direct Memory Access (RDMA) provides access between the main memory of two computers without involving an operating system, cache, or storage. (InfiniBand both refers to the physical link-layer protocol for InfiniBand networks and the InfiniBand Verbs API - an implementation of RDMA).

While fast, InfiniBand is notoriously fussy about software version compatibility and can be challenging to debug. To address the software versioning issue, we recommend using one of the battle-tested PyTorch containers provided here by NVIDIA.

Make sure you choose a container with the same CUDA drivers installed on the machines. To install additional dependencies, build a simple Dockerfile.

Once the container is built, we recommend converting it into a singularity image:

Notable benefits of singularity images are:

Ease of deployment: no daemon is running as root on each node, a container is simply an executable.

Ability to mount local filesystems or do bind mappings so that file paths on cloud nodes can appear to match local ones.

To execute your training command inside the singularity container simply run:

To address the challenge of debugging networking issues, we highly recommend:

1) Using these bandwidth tests to check InfiniBand is correctly configured.

2) Setting the following environment variables in your python script for more verbose logging.

For more information on environment variables see Pytorch docs and NVIDIA docs.

Error Handling

Even with careful implementation, it is unlikely your job will run the first time with no errors. It's important to be notified if errors occur so they can be handled efficiently, and training can be resumed as soon as possible. To achieve this, set up a webhook to send a message if the process exits. Use the trap command in bash to catch signals, then execute a simple function to cleanup hanging processes on the nodes and send an alert.

In Summary

Today we’ve discussed the motivation for multinode training and how to overcome some of the challenges in implementing it. Specifically, we have seen that networking setup and containerization are critical to performant multinode training. Additionally, we observe that errors can still occur, but robust handling of them can increase the uptime of the training run to 99%.

In a future post, we’ll take an in-depth look at resuming training of a stateful model. We’ll see how distributed checkpoints can be used to pause and resume training smoothly, even across a varying number of GPUs/nodes.

Ellena Reid, Machine Learning Engineer, Speechmatics

Power your products with enterprise-grade Voice AI

We handle the speech, you deliver conversations that matter.

Latest Articles

Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate