-
Notifications
You must be signed in to change notification settings - Fork 113
KubeCon Proposal
ElasticDL: A Kubernetes-native Deep Learning Framework
Kubernetes and Machine Learning
In this tutorial, we are going to present ElasticDL, a Kubernetes-native deep learning framework built on top of TensorFlow 2.0. Making use of the priority-based preemption feature of Kubernetes, ElasticDL implements fault-tolerance and elastic scheduling of deep learning jobs.
Currently, users start TensorFlow jobs on Kubernetes use Kubeflow, which provides Kubernetes operators to tell each pod in the job the IP addresses of its peers. This approach doesn't offer fault-tolerance or high-utilizaiton. Suppose that on a cluster of N nodes, a job is using N/2+1 nodes, a new job requiring N/2 nodes has to wait due to the lack of one node, and the general utilization is about 50%. With elastic scheduling, the new job can run with N/2-1 nodes first, and the usage boosts to 100%.
ElasticDL builds elastic scheduling on top of priority-based preemption to make sure that production jobs do not suffer from preemption, and users can run lower priority jobs to use the cluster fully.
The key to elastic scheduling is fault-tolerance; it ensures that the job doesn't rely on checkpoints to recover from the preemption/failure of some pods of a job. ElasticDL uses the properties of deep learning to realize fault-tolerance. For a large model that has big embedding tables and needs many parameter server pods to host, ElasticDL uses the asynchronous SGD algorithm, which doesn't rely on a fixed number of worker pods. For models like image and speech recongition, ElasticDL reimplements AllReduce in a Kubernetes-native and fault-tolerable way.
ElasticDL works with TensorFlow Eager Mode. It can extend to work with other deep learning frameworks, including PyTorch and MxNet.
Dual Presentation: 35 minutes, 2 speakers presenting on a topic