What is Horovod?

Review of distributed training framework for TensorFlow, developed by Uber
20 October 2017   4321

What is Horovod?

Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

Why not traditional Distributed TensorFlow?

The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects:

  1. How much modifications does one have to make to a program to make it distributed, and how easy is it to run it.
  2. How much faster would it run in distributed mode?

Internally at Uber we found that it's much easier for people to understand an MPI model that requires minimal changes to source code than to understand how to set up regular Distributed TensorFlow.

To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers and manually averaging gradients across them, tf.Server()tf.ClusterSpec()tf.train.SyncReplicasOptimizer()tf.train.replicas_device_setter() and so on. If none of these things makes sense to you - don't worry, you don't have to learn them if you use Horovod.

In addition to being easy to use, Horovod is fast. Below is a chart representing the benchmark that was done on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:

Horovod Benchmark
Horovod Benchmark

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 79% scaling efficiency for VGG-16.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at scale.

Learn more at GitHub.

Python News Digest 8 - 14.02

Learn about the best Python tools, why sys.getsizeof is not what you need, how to call await on multiple functions and more
14 February 2020   246

Greetings! I hope your week went great! Here's new Python news digest.

Learn how parallelism can slow down your Python code, implementing interface in Python, how to check if a file is a valid image with Python and other intersting things awaits for you in this digest.

Articles

  • Understanding Best Practice Python Tooling by Comparing Popular Project Templates

Author checked and compare the most popular Python tools in this big article

  • The Parallelism Blues: when faster code is slower

Learn when, why and how parallelism can slow down your Python app

  • sys.getsizeof is not what you want

Learn why sys.getsizeof counts not all the bytes, and even wrong bytes

Guides

  • Implementing an Interface in Python

Tutorial for beginners on how to use a Python interface; understand why interfaces are so useful and learn how to implement formal and informal interfaces in Python

  • Python asyncio and await'ing multiple functions

In this tiny tutorial, you'll learn how to call await on multiple functions in Python using the asyncio package

  • How to Check if a File is a Valid Image with Python

Really small, but a useful tutorial that we will show you how to check if a certain file is a valied image using Python

  • Understand Group by in Django with SQL

Learn and understand what GROUP BY in Django ORM is by comparing QuerySets and SQL

Updates

  • virtualenv

A virtual environment builder for Python