Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Set device for torch tensors with gpu > 1 (#132)" #134

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Apr 13, 2023

This reverts commit 8782c9d (which fixed #131).

Setting the device via the cupy API causes horovod (2GPU) tests to hang with:

[1,1]<stdout>:merlin/models/tf/models/base.py:1387: in fit                                                                                                                          
[1,1]<stdout>:    out = super().fit(**fit_kwargs)                                                                                                                                   
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py:70: in error_handler                                                                            
[1,1]<stdout>:    raise e.with_traceback(filtered_tb) from None                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:78: in __getitem__                                                                             
[1,1]<stdout>:    return self.__next__()                                                                                                                                            
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:82: in __next__                                                                                
[1,1]<stdout>:    converted_batch = self.convert_batch(super().__next__())                                                                                                          
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:261: in __next__                                                                              
[1,1]<stdout>:    return self._get_next_batch()                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:332: in _get_next_batch                                                                       
[1,1]<stdout>:    batch = next(self._batch_itr)                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:369: in make_tensors                                                                          
[1,1]<stdout>:    tensors_by_name = self._convert_df_to_tensors(gdf)                                                                                                                
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner                                                                                                     
[1,1]<stdout>:    result = func(*args, **kwargs)                                                                                                                                    
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:524: in _convert_df_to_tensors                                                                
[1,1]<stdout>:    tensors_by_name[column_name] = self._to_tensor(gdf_i[[column_name]])                                                                                              
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:453: in _to_tensor                                                                            
[1,1]<stdout>:    with cupy.cuda.Device(self.device):                                                                                                                               
[1,1]<stdout>:cupy/cuda/device.pyx:184: in cupy.cuda.device.Device.__enter__                                                                                                        
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:365: in cupy_backends.cuda.api.runtime.setDevice                                                                                   
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _                                                                                                                                                                       
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:>   ???                                                                                                                                                               
[1,1]<stdout>:E   cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal                                                                   
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:142: CUDARuntimeError                                                                                                              

@edknv edknv requested a review from jperez999 April 13, 2023 17:42
@edknv edknv self-assigned this Apr 13, 2023
@edknv edknv added bug Something isn't working chore labels Apr 13, 2023
@edknv edknv added this to the Merlin 23.04 milestone Apr 13, 2023
@edknv edknv merged commit 014b658 into NVIDIA-Merlin:main Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Device assignment does not work in PyTorch
2 participants