MPI Operator DeepSpeed Base Conf for CIFAR-10

This example introduces an integration example of DeepSpeed, a distributed training library, with Kubeflow to the main mpi-operator examples. The objective of this example is to enhance the efficiency and performance of distributed training jobs by harnessing the combined capabilities of DeepSpeed and MPI.

Comments in configuration explains the use of taints and tolerations in the Kubernetes configuration to ensure the proper scheduling of DeepSpeed worker pods on nodes with specific resources, such as GPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
cdelete.sh		cdelete.sh
cifar_ds.Dockerfile		cifar_ds.Dockerfile
deepspeed-config.yaml		deepspeed-config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPI Operator DeepSpeed Base Conf for CIFAR-10

About

Releases

Packages

Languages

dtunai/mpi-ds

Folders and files

Latest commit

History

Repository files navigation

MPI Operator DeepSpeed Base Conf for CIFAR-10

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages