Skip to content

Latest commit

 

History

History
60 lines (46 loc) · 2.29 KB

README.md

File metadata and controls

60 lines (46 loc) · 2.29 KB

RCCL

ROCm Communication Collectives Library

Introduction

RCCL (rickle) is implementation of MPI communication apis on ROCm enabled GPUs. It is a collective communication library whose aim is to provide low-latency and high-bandwidth communication on dense GPU systems. RCCL launches special-purpose compute kernels for parallel overlapping transfers. This involves distributed processing and exchanging data between participating peer-accessible GPUs in a logical ring within a single multi-GPU node.

Supported APIs

  1. AllReduce
  2. Broadcast
  3. Reduce

Requirements

  1. ROCm supported GPUs
  2. ROCm stack installed on the system (HCC)

Build

RCCL directly depends on HIP runtime & HCC C++ compiler which are part of the ROCm SW stack.

git clone https://github.com/ROCmSoftwarePlatform/rccl.git
mkdir rccl_build
cd rccl_build
CXX=/opt/rocm/bin/hcc cmake ../rccl
make package
sudo dpkg -i *.deb

RCCL install requires sudo/root access because it creates a directory called "rccl" under /opt/rocm/. This is an optional step and RCCL can be used directly by including the path containing librccl.so.

Run

rccl library install directory should be added to LD_LIBRARY_PATH

export LD_LIBRARY_PATH=/opt/rocm/rccl/lib:$LD_LIBRARY_PATH

Usage

Please consider rccl/samples/allreduce.cpp

Source Layout

  • inc - contains the public RCCL header exposing the RCCL interfaces
  • src - contains source code for the implementation of the RCCL APIs
  • tests - contains unit tests cases to validate RCCL

Source Naming

The RCCL library consists of two primary layers:

Interface layer

  • rccl.h - C99 APIs as defined by the RCCL library.
  • rccl.cpp - The interface layer implementation encapsulates the functionality by invoking the actual primitive specific C++ template functions.

RCCL primitive specific implementations and kernels

  • rcclDataTypes.h
  • rcclTracker.h
  • rccl{Primitive}Runtime.h
  • rccl{Primitive}Kernels.h
  • rcclKernelHelper.h

Caveats

The initial implementation of the distributed broadcast and all-reduce designs are focused on the functionality and correctness and not tuned yet to obtain optimal performance for a specific input size and GPU count. Better strategies to determine optimal chunk sizes to allow overlapping of transfers for better pipelining are being explored.