To use Horovod with the Intel(R) oneAPI Collective Communications Library (oneCCL), follow the steps below.
- Install oneCCL.
To install oneCCL, follow these steps.
Source setvars.sh
to start using oneCCL.
source <install_dir>/env/setvars.sh
- Set
HOROVOD_CPU_OPERATIONS
variable
export HOROVOD_CPU_OPERATIONS=CCL
- Install Horovod from source code
python setup.py build
python setup.py install
or via pip
pip install horovod
You can specify the affinity for Horovod background thread with the HOROVOD_THREAD_AFFINITY
environment variable.
See the instructions below.
Set Horovod background thread affinity according to the rule - if there is N Horovod processes per node, this variable should contain all the values for every local process using comma as a separator:
export HOROVOD_THREAD_AFFINITY=c0,c1,...,c(N-1)
where c0,...,c(N-1) are core IDs to pin background threads from local processes.
Set the number of oneCCL workers:
export CCL_WORKER_COUNT=X
where X is the number of oneCCL worker threads (workers) per process you'd like to dedicate to drive communication.
Set oneCCL workers affinity automatically:
export CCL_WORKER_AFFINITY=auto
This is default mode. The exact core IDs will depend from process launcher used.
Set oneCCL workers affinity explicitly:
export CCL_WORKER_AFFINITY=c0,c1,..,c(X-1)
where c0,c1,..,c(X-1) are core IDs dedicated to local oneCCL workers, i.e. X = CCL_WORKER_COUNT
* Number of processes per node.
Please refer to Execution of Communication Operations for more information.
For example, we have 2 nodes and each node has 2 sockets: socket0 CPUs: 0-17,36-53 and socket1 CPUs: 18-35,54-71. We dedicate the last two cores of each socket for 2 oneCCL workers and pin Horovod background thread to one of the hyper-thread cores of oneCCL workers's cores. All these cores are excluded from Intel MPI pinning using I_MPI_PIN_PROCESSOR_EXCLUDE_LIST
to dedicate them to oneCCL and Horovod tasks only, thus avoiding the conflict with framework's computational threads.
export CCL_WORKER_COUNT=2
export CCL_WORKER_AFFINITY="16,17,34,35"
export HOROVOD_THREAD_AFFINITY="53,71"
export I_MPI_PIN_DOMAIN=socket
export I_MPI_PIN_PROCESSOR_EXCLUDE_LIST="16,17,34,35,52,53,70,71"
mpirun -n 4 -ppn 2 -hostfile hosts python ./run_example.py
Set cache hint for oneCCL operations:
export HOROVOD_CCL_CACHE=0|1
Available for allreduce
only yet. Disabled by default.
Please refer to Caching of Communication Operations for more information.