Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with running code on multiple gpu's #7

Open
samyakag opened this issue Jan 17, 2022 · 3 comments
Open

Error with running code on multiple gpu's #7

samyakag opened this issue Jan 17, 2022 · 3 comments

Comments

@samyakag
Copy link

samyakag commented Jan 17, 2022

While Trying to train with 4 gpus and arg.multiGPU = True, the following error occurs: torch.nn.modules.module.ModuleAttributeError: 'ModelU' object has no attribute 'lxrt_encoder'
Traceback (most recent call last):
File "hm.py", line 392, in
main()
File "hm.py", line 346, in main
hm = HM()
File "hm.py", line 91, in init
self.model.lxrt_encoder.multi_gpu()
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in getattr
type(self).name, name))
torch.nn.modules.module.ModuleAttributeError: 'ModelU' object has no attribute 'lxrt_encoder'

Even after commenting out the above line in hm.py file, the model seems to train on a single GPU only according to nvidia-smi.

@Muennighoff
Copy link
Owner

If you want to run the models via Data Parallelism you will need to wrap them in torch.nn.DataParallel - In its current state the args.multiGPU does not work

@samyakag
Copy link
Author

Yeah, I tired doing that by changing:
self.model = self.model.cuda() to
self.model = nn.DataParallel(self.model.cuda())
but encountered this error:
Traceback (most recent call last):
File "hm.py", line 390, in
main()
File "hm.py", line 357, in main
hm.train(hm.train_tuple, hm.valid_tuple)
File "hm.py", line 184, in train
logit = self.model(sent, (feats, boxes))
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
AssertionError: Caught AssertionError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/samyakxd/miniconda3/envs/vilio_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/scratch/samyakxd/vilio/entryU.py", line 201, in forward
input_ids, img_feats, img_pos_feats, attn_masks, gather_index = self.preprocess_bert(sents, visual_feats, self.num_features, self.tokenizer)
File "/scratch/samyakxd/vilio/entryU.py", line 192, in preprocess_bert
gather_index = self.get_gather_index(txt_lens, num_bbs, bs, max_tl, out_size)
File "/scratch/samyakxd/vilio/entryU.py", line 211, in get_gather_index
assert len(txt_lens) == len(num_bbs) == batch_size
AssertionError

@Muennighoff
Copy link
Owner

Yeah there is some preprocessing still happening in entryU - Maybe try instead wrapping
self.model, loading_info = BertU.from_pretrained(self.tr_name, img_dim=2048, output_loading_info=True) the self.model inside entryU with torch.nn.DataParallel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants