What is Horovod?

Review of distributed training framework for TensorFlow, developed by Uber
20 October 2017   2368

What is Horovod?

Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

Why not traditional Distributed TensorFlow?

The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects:

  1. How much modifications does one have to make to a program to make it distributed, and how easy is it to run it.
  2. How much faster would it run in distributed mode?

Internally at Uber we found that it's much easier for people to understand an MPI model that requires minimal changes to source code than to understand how to set up regular Distributed TensorFlow.

To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers and manually averaging gradients across them, tf.Server()tf.ClusterSpec()tf.train.SyncReplicasOptimizer()tf.train.replicas_device_setter() and so on. If none of these things makes sense to you - don't worry, you don't have to learn them if you use Horovod.

In addition to being easy to use, Horovod is fast. Below is a chart representing the benchmark that was done on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:

Horovod Benchmark
Horovod Benchmark

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 79% scaling efficiency for VGG-16.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at scale.

Learn more at GitHub.

Facebook to Release PyTorch 1.0

This release added support for large cloud platforms, a C ++ interface, a set of JIT compilers
10 December 2018   125

Facebook has released a stable version of the library for machine learning PyTorch 1.0. This iteration added support for large cloud platforms, a C ++ interface, a set of JIT compilers, and various improvements.

The stable version received a set of JIT compilers that eliminate the dependence of the code on the Python interpreter. The model code is transformed into Torch Script - a superstructure over Python. Keeping the opportunity to work with the model in the Python environment, the user can download it to other projects not related to this language. So, the PyTorch developers state that the code processed in this way can be used in the C ++ API.

The torch.distributed package and the torch.nn.parallel.DistributedDataParallel module are completely redesigned. torch.distributed now has better performance and works asynchronously with the Gloo, NCCL and MPI libraries.

The developers added a C ++ wrapper to PyTorch 1.0. It contains analogs of Python interface components, such astorch.nn,torch.optim, torch.data. According to the creators, the new interface should provide high performance for C ++ applications. True, the C ++ API is still experimental, but it can be used in projects now.

To improve the efficiency of working with PyTorch 1.0, a Torch Hub repository has been created, which stores pre-trained models of neural networks. You can publish your own development using the hubconf.py file, after which the model will be available for download by any user via the torch.hub.load API.

Support for C extensions and the module torch.utils.trainer were removed from the library.

Facebook released the preliminary version of PyTorch 1.0 at the beginning of October 2018, and in two months the developers brought the framework to a stable state.