-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running on specific gpu(s) #46
Comments
That's actually an interesting use case we haven't addressed. Normally when we only want to use part of a shared machine we only use GPUs 0,1 or 0,1,2 etc. This is because the nccl communication library doesn't work properly if you reassign gpu mappings with I'll try and push a small update by tomorrow that should allow you to use any contiguous span of GPUs (ie. 6,7 or 4,5,6). Sorry if this still wouldn't suit your needs. The last option I would recommend is to run nvidia-docker on your DGX-1 and try to map only the available GPUs you need. This should be able to work out of the box with our |
You can now train on GPU 6-7 by running You can also now run training on a particular gpu by setting |
I am running now: Is this the correct command? Additionally, I receive the error below:
|
Oh sorry you're right. My argparsing could've been a bit better. Try Also, if you only intend to use 1 gpu the usage of multiproc is unnecessary. |
Thx! Stupid I could not figure out that myself. If I run the command below, the process in running om GPU 0 and 7 Where PID=30004 is another task running on the machine. The main idea is preventing the current tasks do not interfere with process 30004. |
You're right. Just fixed an edge case, so pull and try again. |
Should I pull master? I get it is up-to-date and I also see no commit. |
hmm, you're right, it doesn't seem to have pushed properly |
Ok it should be up now properly |
I tested the code again. World-size one works perfect: World-size two gives an error:
|
Is it also possible to develop the ability to apply world_size and base-gpu to transfer.py (and clasifier.py)? |
Yeah for some reason torch's serialization library doesn't work properly for reloading the best model checkpoint for distributed. If it's not too much trouble, I would just try running it again, but skip the training/validation loops (with |
See #48 for a discussion on multigpu in transfer.py Any chance you have that file uploaded somewhere? |
I want to train sentiment discovery on a new dataset using a DGX-1 machine shared machine. For this I can only use a limited set of GPUs. How can I specify to only run the process on GPUs 7 and 8 (6 - 7)?
The text was updated successfully, but these errors were encountered: