Skip to content

Latest commit

 

History

History
112 lines (73 loc) · 4.72 KB

gpus.md

File metadata and controls

112 lines (73 loc) · 4.72 KB

Horovod on GPU

To use Horovod on GPU, read the options below and see which one applies to you best.

Have GPUs?

In most situations, using NCCL 2 will significantly improve performance over the CPU version. NCCL 2 provides the allreduce operation optimized for NVIDIA GPUs and a variety of networking devices, such as RoCE or InfiniBand.

  1. Install NCCL 2.

Steps to install NCCL 2 are listed here.

If you have installed NCCL 2 using the nccl-<version>.txz package, you should add the library path to LD_LIBRARY_PATH environment variable or register it in /etc/ld.so.conf.

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-<version>/lib
  1. Install Open MPI or another MPI implementation.

Steps to install Open MPI are listed here.

  1. Install the horovod pip package.

If you have installed NCCL 2 using the nccl-<version>.txz package, you should specify the path to NCCL 2 using the HOROVOD_NCCL_HOME environment variable.

$ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

If you have installed NCCL 2 using the Ubuntu package, you can simply run:

$ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

Note: Some models with a high computation to communication ratio benefit from doing allreduce on CPU, even if a GPU version is available. To force allreduce to happen on CPU, pass device_dense='/cpu:0' to hvd.DistributedOptimizer:

opt = hvd.DistributedOptimizer(opt, device_dense='/cpu:0')

Advanced: Have GPUs and networking with GPUDirect?

GPUDirect allows GPUs to transfer memory among each other without CPU involvement, which significantly reduces latency and load on CPU. NCCL 2 is able to use GPUDirect automatically for allreduce operation if it detects it.

Additionally, Horovod uses allgather and broadcast operations from MPI. They are used for averaging sparse tensors that are typically used for embeddings, and for broadcasting initial state. To speed these operations up with GPUDirect, make sure your MPI implementation supports CUDA and add HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI to the pip command.

  1. Install NCCL 2.

Steps to install NCCL 2 are listed here.

If you have installed NCCL 2 using the nccl-<version>.txz package, you should add the library path to LD_LIBRARY_PATH environment variable or register it in /etc/ld.so.conf.

$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-<version>/lib
  1. Install nv_peer_memory driver.

Follow instructions from that page, and make sure to do /etc/init.d/nv_peer_mem start in the end.

  1. Install Open MPI or another MPI implementation with CUDA support.

Steps to install Open MPI are listed here. You should make sure you build it with CUDA support.

  1. Install the horovod pip package.

If you have installed NCCL 2 using the nccl-<version>.txz package, you should specify the path to NCCL 2 using the HOROVOD_NCCL_HOME environment variable.

$ HOROVOD_NCCL_HOME=/usr/local/nccl-<version> HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod

If you have installed NCCL 2 using the Ubuntu package, you can simply run:

$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod

Note: Allgather allocates an output tensor which is proportionate to the number of processes participating in the training. If you find yourself running out of GPU memory, you can force allreduce to happen on CPU by passing device_sparse='/cpu:0' to hvd.DistributedOptimizer:

opt = hvd.DistributedOptimizer(opt, device_sparse='/cpu:0')

Advanced: Have MPI optimized for your network?

If you happen to have network hardware not supported by NCCL 2 or your MPI vendor's implementation on GPU is faster, you can also use the pure MPI version of allreduce, allgather and broadcast on GPU.

  1. Make sure your MPI implementation is installed.

  2. Install the horovod pip package.

$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod