Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested Config.py settings for a DGX-1 #5

Open
ProgramItUp opened this issue Jan 26, 2017 · 3 comments
Open

Suggested Config.py settings for a DGX-1 #5

ProgramItUp opened this issue Jan 26, 2017 · 3 comments

Comments

@ProgramItUp
Copy link

After running the _train.sh with the default Config.py on a DGX-1 for about an hour I see that the CPU usage stays pretty constant at about 15%, and one GPU is being used at about 40%.

The settngs in Config.py are unchanged: DYNAMIC_SETTINGS = True. The number of trainers varies between 2 and 6, the number of predictors varies between 1 and 2 and the number of agents varies from 34 to 39. I would have expected them to grow to use the available CPU resources.

  1. Are there settings that will better leverage the cores on a DGX-1?
  2. It looks like the code in NetworkVP.py is written for a single GPU. With TensorFlow's support for multiple GPU's, do you have plans to add it? On the surface it seems pretty easy to add:
for d in ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']:    
    with tf.device(d):
       .... calcs here...

@ifrosio
Copy link
Collaborator

ifrosio commented Jan 27, 2017

  1. We cannot answer to this without experimenting. The best approach may be to do a grid search and make sure that dynamic scheduling is close to optimal, as we expect.
  2. We are working on a multi-GPU implementation. When using a small DNN (e.g. the default A3C network), the bottleneck is the GPU-CPU communication, so adding more GPUs naively does not help in this case. A more sophisticated method is required to leverage the computational power of multiple GPUs.

@developeralgo8888
Copy link

developeralgo8888 commented Feb 5, 2017

When do you expect Multi GPUs implementation is going to be ready? . 99% of researchers or AI users use Multiple Nvidia GPUs on a single system for research, tests and quick training before they pull in the Big Guns -- Grid Super computers. I am not sure why your team never thought of the Multiple GPUs implementation first ? It would have made your code very efficient in using multiple GPUs by simply selecting the number of GPUs to use 1, 2, 3 4, or 8 .

@mbz
Copy link
Contributor

mbz commented Feb 6, 2017

@developeralgo8888 we don't have an eta yet but working on it. As @ifrosio mentioned, a naive multi-GPU implementing does not improve the converge rate and may cause instabilities. A naive data parallelism implementation (which I believe is what you are suggesting %99 of researchers are using) will put more pressure on GA3C bottleneck (i.e. CPU-GPU communication) and therefore the return is nothing. Feel free to implement the code you are suggesting (it shouldn't be more than two lines of code as you suggested) but it's very unlikely that improves the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants