Suggested Config.py settings for a DGX-1 #5

ProgramItUp · 2017-01-26T20:46:46Z

After running the _train.sh with the default Config.py on a DGX-1 for about an hour I see that the CPU usage stays pretty constant at about 15%, and one GPU is being used at about 40%.

The settngs in Config.py are unchanged: DYNAMIC_SETTINGS = True. The number of trainers varies between 2 and 6, the number of predictors varies between 1 and 2 and the number of agents varies from 34 to 39. I would have expected them to grow to use the available CPU resources.

Are there settings that will better leverage the cores on a DGX-1?
It looks like the code in NetworkVP.py is written for a single GPU. With TensorFlow's support for multiple GPU's, do you have plans to add it? On the surface it seems pretty easy to add:

for d in ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']:    
    with tf.device(d):
       .... calcs here...

The text was updated successfully, but these errors were encountered:

ifrosio · 2017-01-27T19:33:56Z

We cannot answer to this without experimenting. The best approach may be to do a grid search and make sure that dynamic scheduling is close to optimal, as we expect.
We are working on a multi-GPU implementation. When using a small DNN (e.g. the default A3C network), the bottleneck is the GPU-CPU communication, so adding more GPUs naively does not help in this case. A more sophisticated method is required to leverage the computational power of multiple GPUs.

developeralgo8888 · 2017-02-05T20:59:30Z

When do you expect Multi GPUs implementation is going to be ready? . 99% of researchers or AI users use Multiple Nvidia GPUs on a single system for research, tests and quick training before they pull in the Big Guns -- Grid Super computers. I am not sure why your team never thought of the Multiple GPUs implementation first ? It would have made your code very efficient in using multiple GPUs by simply selecting the number of GPUs to use 1, 2, 3 4, or 8 .

mbz · 2017-02-06T16:55:15Z

@developeralgo8888 we don't have an eta yet but working on it. As @ifrosio mentioned, a naive multi-GPU implementing does not improve the converge rate and may cause instabilities. A naive data parallelism implementation (which I believe is what you are suggesting %99 of researchers are using) will put more pressure on GA3C bottleneck (i.e. CPU-GPU communication) and therefore the return is nothing. Feel free to implement the code you are suggesting (it shouldn't be more than two lines of code as you suggested) but it's very unlikely that improves the performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested Config.py settings for a DGX-1 #5

Suggested Config.py settings for a DGX-1 #5

ProgramItUp commented Jan 26, 2017

ifrosio commented Jan 27, 2017

developeralgo8888 commented Feb 5, 2017 •

edited

Loading

mbz commented Feb 6, 2017

Suggested Config.py settings for a DGX-1 #5

Suggested Config.py settings for a DGX-1 #5

Comments

ProgramItUp commented Jan 26, 2017

ifrosio commented Jan 27, 2017

developeralgo8888 commented Feb 5, 2017 • edited Loading

mbz commented Feb 6, 2017

developeralgo8888 commented Feb 5, 2017 •

edited

Loading