Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on specific gpu(s) #46

Open
SG87 opened this issue Oct 3, 2018 · 13 comments
Open

Running on specific gpu(s) #46

SG87 opened this issue Oct 3, 2018 · 13 comments

Comments

@SG87
Copy link

SG87 commented Oct 3, 2018

I want to train sentiment discovery on a new dataset using a DGX-1 machine shared machine. For this I can only use a limited set of GPUs. How can I specify to only run the process on GPUs 7 and 8 (6 - 7)?

@raulpuric
Copy link
Contributor

That's actually an interesting use case we haven't addressed.

Normally when we only want to use part of a shared machine we only use GPUs 0,1 or 0,1,2 etc.

This is because the nccl communication library doesn't work properly if you reassign gpu mappings with CUDA_VISIBLE_DEVICES=6,7.

I'll try and push a small update by tomorrow that should allow you to use any contiguous span of GPUs (ie. 6,7 or 4,5,6).

Sorry if this still wouldn't suit your needs.

The last option I would recommend is to run nvidia-docker on your DGX-1 and try to map only the available GPUs you need. This should be able to work out of the box with our multiproc.py script

@raulpuric
Copy link
Contributor

raulpuric commented Oct 8, 2018

You can now train on GPU 6-7 by running multiproc training with --world_size=2 and --base-gpu=6.

You can also now run training on a particular gpu by setting --base-gpu to that device number (0-indexed).

@SG87
Copy link
Author

SG87 commented Oct 9, 2018

I am running now:
python3 -m multiproc main.py --data data.csv --world_size=1 --base-gpu=7

Is this the correct command?
When I inspect nvidia-smi, I see that I am using all GPU's.

Additionally, I receive the error below:

Traceback (most recent call last): File "main.py", line 415, in <module> Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) deserialized_storage_keys = pickle_module.load(f) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load _pickle.UnpicklingError deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

@raulpuric
Copy link
Contributor

Oh sorry you're right.

My argparsing could've been a bit better.

Try python3 -m multiproc main.py --data data.csv --world_size 1 --base-gpu 7 instead.

Also, if you only intend to use 1 gpu the usage of multiproc is unnecessary.

@SG87
Copy link
Author

SG87 commented Oct 11, 2018

Thx! Stupid I could not figure out that myself.

If I run the command below, the process in running om GPU 0 and 7
python3 -m multiproc main.py --data data.csv --world_size 2 --base-gpu 6

schermafbeelding 2018-10-11 om 11 48 11

Where PID=30004 is another task running on the machine. The main idea is preventing the current tasks do not interfere with process 30004.

@raulpuric
Copy link
Contributor

You're right. Just fixed an edge case, so pull and try again.

@SG87
Copy link
Author

SG87 commented Oct 12, 2018

You're right. Just fixed an edge case, so pull and try again.

Should I pull master? I get it is up-to-date and I also see no commit.

@raulpuric
Copy link
Contributor

raulpuric commented Oct 12, 2018

hmm, you're right, it doesn't seem to have pushed properly

@raulpuric
Copy link
Contributor

Ok it should be up now properly

@SG87
Copy link
Author

SG87 commented Oct 24, 2018

I tested the code again.

World-size one works perfect:
python3 main.py --data data/data.csv --world_size 1 --base-gpu 7

World-size two gives an error:
python3 -m multiproc main.py --data data/data.csv --world_size 2 --base-gpu 6
Error:

Traceback (most recent call last):
File "main.py", line 402, in
model.load_state_dict(torch.load(args.save, 'cpu'))
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load
return _load(f, map_location, pickle_module)
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 549, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 4358612012146265478 got 256
Traceback (most recent call last):
File "main.py", line 402, in
model.load_state_dict(torch.load(args.save, 'cpu'))
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load
return _load(f, map_location, pickle_module)
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 549, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 4358612012146265478 got 256
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

@SG87
Copy link
Author

SG87 commented Oct 24, 2018

Is it also possible to develop the ability to apply world_size and base-gpu to transfer.py (and clasifier.py)?

@raulpuric
Copy link
Contributor

Yeah for some reason torch's serialization library doesn't work properly for reloading the best model checkpoint for distributed. If it's not too much trouble, I would just try running it again, but skip the training/validation loops (with continue or something).

@raulpuric
Copy link
Contributor

See #48 for a discussion on multigpu in transfer.py

Any chance you have that file uploaded somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants