ElasticDL code overall review discussion

Jump to bottom

Bright edited this page Sep 26, 2019 · 11 revisions

20190924

Need further discussion

It's a common scenario that the task list contains some successive training tasks and then some following evaluation tasks. The dataset is using prefetch function. While the worker is handling the last train task in the sublist, the prefetch action will pull all the successive evaluation tasks into this worker.
Proposal: Use two separate datasets for training and evaluation. Make the whole process simple and clean. Discuss it in the next meeting.
How to do early stop? Early stop need the training and evaluation metrics to make the decision.
Proposal: Consider master control the early stop? How to write hook or callback to get this metrics value? How to insert the hook into training loop?
Refactor the evaluation process.
Each time executing the ElasticDL command, the client will build a new docker image. Support reusing the existed image.
Proposal: Separate the docker image and use defined model code.

Include in next plan

Support fail over of the EmbeddingService Redis cluster. At the present, it's single point.
Proposal: Redis is the temporary solution, the next plan is using multiple PS.

Performance issue

In the process of training and evaluation, we will call GetModel to get the whole model if the model_version is updated. The RPC payload is high.
For Sync SGD scenario, the speed of all the workers are not exactly the same (Worker number: N). The larger N is, the more workers will get the model of expired version, as a result there will be more out-of-date computation.