Add support for Horovod as a distributed backend #1518

tgaddair · 2020-04-17T16:42:02Z

🚀 Feature

Horovod is a framework for performing data-parallel distributed training for PyTorch (in addition to other frameworks like TensorFlow and MXNet). It uses the allreduce technique to synchronously aggregate gradients across workers, similar to PyTorch's DDP API.

The goal of this feature is to implemented support for Horovod as another distributed_backend option for PyTorch Lightning, providing an abstraction layer over the Horovod API so users don't need to modify their training code when scaling from one GPU to many.

Motivation

At Uber, many of our researchers are interested in adopting PyTorch Lightning as a standard platform-level API. Because our infrastructure is highly integrated with Horovod, one of the prerequisites for adoption is to be able to run PyTorch Lightning using Horovod for distributed training.

We considered making this an internal layer built on top of PyTorch Lightning, but because Horovod is a popular API used by other companies in industry, we thought this would make the most sense as a contribution to PyTorch Lightning.

Pitch

With this change, all users would need to do to add Horovod support would be to make the following change to their Trainer to run on GPU (single or multiple):

trainer = Trainer(distributed_backend='horovod', gpus=1)

Or to run on CPU:

trainer = Trainer(distributed_backend='horovod')

Then the training script can be launched via the horovodrun command-line tool, where the host/GPU allocation is specified:

horovodrun -np 8 -H host1:4,host2:4 python train.py

Alternatives

Build Horovod support outside of PyTorch Lightning. This has been some by some users in the past, but requires building a separate abstraction of Lightning. It'll be difficult to keep such solutions in sync as Lightning continues to add new features, or to make it fully compatible with user LightningModules (if we need to use the same methods/hooks to implement the required functionality).
Launch Horovod in-process as opposed to from a driver application. Horovod supports launching programmatically via the horovod.run API. However, this requires pickling code, which is prone to serialization errors for some models. Most Horovod users are accustomed to using horovodrun / mpirun to launch their jobs. Also, using horovodrun allows us to decouple the training code from the resource requirements (num_gpus, etc.) which is useful for our users.

Additional context

A proof of concept has been implemented here: master...tgaddair:horovod

Docs and unit tests are forthcoming. But before creating a full PR, I wanted to get the thoughts of the PyTorch Lightning devs to see if this implementation aligns with your goals for the project.

cc @alsrgv

The text was updated successfully, but these errors were encountered:

github-actions · 2020-04-17T16:42:44Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-04-17T16:45:01Z

yes! this is awesome. we talked about it early on but seemed no one was using it much.
However, I do think this would be a great addition.

Mind submitting a PR with this change?
If you give me a day, I can refactor DDP so that the horovod integration can be easier to do

tgaddair · 2020-04-17T16:49:37Z

Hi @williamFalcon, thanks for getting back to me so quickly. I have a branch out there that is feature complete if you'd like to take a quick look: master...tgaddair:horovod

My plan was to submit the PR once I add docs / unit tests later today or Monday at the latest. Does that sound reasonable?

The code itself should be fully functional in its current form (tested on some simple models like MNIST with 2 GPUs / CPU), but would be happy to refactor if you think the abstractions can be improved.

alsrgv · 2020-04-17T18:30:14Z

+1 for Horovod support. While DDP came long way, it does not offer a way to aggregate complex metric Python objects, which is easily doable with Horovod + mpi4py. I've also had a lot of segfaults among cv2, DataLoader multiprocessing and DDP forking, while it all works dandy with Horovod one-processes-per-GPU ideology.

tgaddair added feature Is an improvement or enhancement help wanted Open to be worked on labels Apr 17, 2020

williamFalcon added the priority: 0 High priority task label Apr 17, 2020

williamFalcon assigned williamFalcon, Borda and tgaddair Apr 17, 2020

Borda added this to the 0.7.4 milestone Apr 17, 2020

tgaddair mentioned this issue Apr 19, 2020

Added Horovod distributed backend #1529

Merged

williamFalcon closed this as completed in #1529 Apr 22, 2020

Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Horovod as a distributed backend #1518

Add support for Horovod as a distributed backend #1518

tgaddair commented Apr 17, 2020

github-actions bot commented Apr 17, 2020

williamFalcon commented Apr 17, 2020

tgaddair commented Apr 17, 2020 •

edited

Loading

alsrgv commented Apr 17, 2020

Add support for Horovod as a distributed backend #1518

Add support for Horovod as a distributed backend #1518

Comments

tgaddair commented Apr 17, 2020

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

github-actions bot commented Apr 17, 2020

williamFalcon commented Apr 17, 2020

tgaddair commented Apr 17, 2020 • edited Loading

alsrgv commented Apr 17, 2020

tgaddair commented Apr 17, 2020 •

edited

Loading