What is Horovod?

Review of distributed training framework for TensorFlow, developed by Uber
20 October 2017   1549

What is Horovod?

Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning fast and easy to use.

Why not traditional Distributed TensorFlow?

The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects:

  1. How much modifications does one have to make to a program to make it distributed, and how easy is it to run it.
  2. How much faster would it run in distributed mode?

Internally at Uber we found that it's much easier for people to understand an MPI model that requires minimal changes to source code than to understand how to set up regular Distributed TensorFlow.

To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers and manually averaging gradients across them, tf.Server()tf.ClusterSpec()tf.train.SyncReplicasOptimizer()tf.train.replicas_device_setter() and so on. If none of these things makes sense to you - don't worry, you don't have to learn them if you use Horovod.

In addition to being easy to use, Horovod is fast. Below is a chart representing the benchmark that was done on 32 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:

Horovod Benchmark
Horovod Benchmark

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 79% scaling efficiency for VGG-16.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at scale.

Learn more at GitHub.

Canonical to Represent Minimal Ubuntu

New version of Ubuntu is created for servers, isolated containers based on Docker and cloud systems
12 July 2018   101

Ubuntu team presented a simplified version of the base image - Minimal Ubuntu. It is designed for servers, isolated containers based on Docker and cloud systems. The release features high performance, minimal load time and automation of applications in the cloud.

The small footprint of Minimal Ubuntu, when deployed with fast VM provisioning from GCE, helps deliver drastically improved boot times, making them a great choice for developers looking to build their applications on Google Cloud Platform.
 

Paul Nash

Group Product Manager, Google Cloud

The authors of the project emphasize the size of the distribution kit, which "weighs" 157 MB, and also supports the main cloud systems like Amazon EC2, Google Compute Engine (GCE), LXD and KVM / OpenStack, each of which has its own optimized version of the package. In addition, the OS-based image for operating with containers based on the Docker platform, compatible with the Kubernetes.

Minimal Ubuntu is designed for automated execution, so it includes only a minimal set of tools. The distribution can be upgraded to a set of Ubuntu Server packages using the special utility "unminimize", which returns components that are convenient for interactive management.

According to Canonical representatives, the deletion of the manual control functions resulted in the acceleration of the load time by 40% and the reduction of the occupied disk space by 50%. At the same time, this release remained completely compatible with all the packages from standard Ubuntu repositories. Required packages can be installed using the standard package manager apt or using snapd, which are included in the distribution by default.

Two assemblies are available for download, based on Ubuntu 16.04 LTS and 18.04 LTS. You can download them on the official website.