Habana® Gaudi® AI Processor (HPU) training processors are built on a heterogeneous architecture with a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries, and a configurable Matrix Math engine.
The Graphcore Intelligence Processing Unit (IPU), built for Artificial Intelligence and Machine Learning, consists of many individual cores, called tiles, allowing highly parallel computation. Due to the high bandwidth between tiles, IPUs facilitate machine learning loads where parallelization is essential. Because computation is heavily parallelized,
The Colossal-AI implements ZeRO-DP with chunk-based memory management. With this chunk mechanism, really large models can be trained with a small number of GPUs.
Hivemind - collaborative training tries to solve the need for top-tier multi-GPU servers by allowing you to train across unreliable machines, such as local machines or even preemptible cloud compute across the internet.
Bagua is a deep learning training acceleration framework which supports multiple advanced distributed training algorithms.
Horovod allows the same training script to be used for single-GPU, multi-GPU, and multi-node training.