Running on specific gpu(s) #46

SG87 · 2018-10-03T13:31:33Z

I want to train sentiment discovery on a new dataset using a DGX-1 machine shared machine. For this I can only use a limited set of GPUs. How can I specify to only run the process on GPUs 7 and 8 (6 - 7)?

raulpuric · 2018-10-08T15:45:57Z

That's actually an interesting use case we haven't addressed.

Normally when we only want to use part of a shared machine we only use GPUs 0,1 or 0,1,2 etc.

This is because the nccl communication library doesn't work properly if you reassign gpu mappings with CUDA_VISIBLE_DEVICES=6,7.

I'll try and push a small update by tomorrow that should allow you to use any contiguous span of GPUs (ie. 6,7 or 4,5,6).

Sorry if this still wouldn't suit your needs.

The last option I would recommend is to run nvidia-docker on your DGX-1 and try to map only the available GPUs you need. This should be able to work out of the box with our multiproc.py script

raulpuric · 2018-10-08T16:01:18Z

You can now train on GPU 6-7 by running multiproc training with --world_size=2 and --base-gpu=6.

You can also now run training on a particular gpu by setting --base-gpu to that device number (0-indexed).

SG87 · 2018-10-09T13:53:31Z

I am running now:
python3 -m multiproc main.py --data data.csv --world_size=1 --base-gpu=7

Is this the correct command?
When I inspect nvidia-smi, I see that I am using all GPU's.

Additionally, I receive the error below:

Traceback (most recent call last): File "main.py", line 415, in <module> Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) deserialized_storage_keys = pickle_module.load(f) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load _pickle.UnpicklingError deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down Traceback (most recent call last): File "main.py", line 415, in <module> model.load_state_dict(torch.load(args.save, 'cpu')) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load return _load(f, map_location, pickle_module) File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 544, in _load deserialized_storage_keys = pickle_module.load(f) _pickle.UnpicklingError terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

raulpuric · 2018-10-09T20:56:10Z

Oh sorry you're right.

My argparsing could've been a bit better.

Try python3 -m multiproc main.py --data data.csv --world_size 1 --base-gpu 7 instead.

Also, if you only intend to use 1 gpu the usage of multiproc is unnecessary.

SG87 · 2018-10-11T09:55:56Z

Thx! Stupid I could not figure out that myself.

If I run the command below, the process in running om GPU 0 and 7
python3 -m multiproc main.py --data data.csv --world_size 2 --base-gpu 6

Where PID=30004 is another task running on the machine. The main idea is preventing the current tasks do not interfere with process 30004.

raulpuric · 2018-10-11T17:56:34Z

You're right. Just fixed an edge case, so pull and try again.

SG87 · 2018-10-12T07:04:43Z

You're right. Just fixed an edge case, so pull and try again.

Should I pull master? I get it is up-to-date and I also see no commit.

raulpuric · 2018-10-12T16:01:36Z

hmm, you're right, it doesn't seem to have pushed properly

raulpuric · 2018-10-12T16:23:45Z

Ok it should be up now properly

SG87 · 2018-10-24T14:44:23Z

I tested the code again.

World-size one works perfect:
python3 main.py --data data/data.csv --world_size 1 --base-gpu 7

World-size two gives an error:
python3 -m multiproc main.py --data data/data.csv --world_size 2 --base-gpu 6
Error:

Traceback (most recent call last):
File "main.py", line 402, in
model.load_state_dict(torch.load(args.save, 'cpu'))
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load
return _load(f, map_location, pickle_module)
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 549, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 4358612012146265478 got 256
Traceback (most recent call last):
File "main.py", line 402, in
model.load_state_dict(torch.load(args.save, 'cpu'))
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 358, in load
return _load(f, map_location, pickle_module)
File "/home/cmarrecau/.local/lib/python3.5/site-packages/torch/serialization.py", line 549, in _load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: storage has wrong size: expected 4358612012146265478 got 256
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /pytorch/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /pytorch/third_party/gloo/gloo/cuda_private.h:40: driver shutting down

SG87 · 2018-10-24T14:57:32Z

Is it also possible to develop the ability to apply world_size and base-gpu to transfer.py (and clasifier.py)?

raulpuric · 2018-10-27T17:53:00Z

Yeah for some reason torch's serialization library doesn't work properly for reloading the best model checkpoint for distributed. If it's not too much trouble, I would just try running it again, but skip the training/validation loops (with continue or something).

raulpuric · 2018-10-27T17:56:07Z

See #48 for a discussion on multigpu in transfer.py

Any chance you have that file uploaded somewhere?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on specific gpu(s) #46

Running on specific gpu(s) #46

SG87 commented Oct 3, 2018

raulpuric commented Oct 8, 2018

raulpuric commented Oct 8, 2018 •

edited

Loading

SG87 commented Oct 9, 2018

raulpuric commented Oct 9, 2018

SG87 commented Oct 11, 2018

raulpuric commented Oct 11, 2018

SG87 commented Oct 12, 2018

raulpuric commented Oct 12, 2018 •

edited

Loading

raulpuric commented Oct 12, 2018

SG87 commented Oct 24, 2018

SG87 commented Oct 24, 2018

raulpuric commented Oct 27, 2018

raulpuric commented Oct 27, 2018

Running on specific gpu(s) #46

Running on specific gpu(s) #46

Comments

SG87 commented Oct 3, 2018

raulpuric commented Oct 8, 2018

raulpuric commented Oct 8, 2018 • edited Loading

SG87 commented Oct 9, 2018

raulpuric commented Oct 9, 2018

SG87 commented Oct 11, 2018

raulpuric commented Oct 11, 2018

SG87 commented Oct 12, 2018

raulpuric commented Oct 12, 2018 • edited Loading

raulpuric commented Oct 12, 2018

SG87 commented Oct 24, 2018

SG87 commented Oct 24, 2018

raulpuric commented Oct 27, 2018

raulpuric commented Oct 27, 2018

raulpuric commented Oct 8, 2018 •

edited

Loading

raulpuric commented Oct 12, 2018 •

edited

Loading