Flexibel, easy-to-use, opinionated
dmlcloud is a library for distributed training of deep learning models with torch. Unlike other similar frameworks, dmcloud adds as little additional complexity and abstraction as possible. It is tailored towards a carefully selected set of libraries and workflows.
pip install dmlcloud
- Easy initialization of
torch.distributed
(supports slurm and MPI). - Simple, yet powerful, API. No unnecessary abstractions and complications.
- Checkpointing and metric tracking (distributed)
- Extensive logging and diagnostics out-of-the-box. Greatly improve reproducability and traceability.
- A wealth of useful utility functions required for distributed training (e.g. for data set sharding)