What is Horovod?
Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning fast and easy to use.
Why not traditional Distributed TensorFlow?
The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects:
- How much modifications does one have to make to a program to make it distributed, and how easy is it to run it.
- How much faster would it run in distributed mode?
Internally at Uber we found that it's much easier for people to understand an MPI model that requires minimal changes to source code than to understand how to set up regular Distributed TensorFlow.
To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers and manually averaging gradients across them,
tf.train.replicas_device_setter() and so on. If none of these things makes sense to you - don't worry, you don't have to learn them if you use Horovod.
In addition to being easy to use, Horovod is fast. Below is a chart representing the benchmark that was done on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:
Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 79% scaling efficiency for VGG-16.
While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at scale.
Learn more at GitHub.