Revert "Set device for torch tensors with gpu > 1 (#132)" #134

edknv · 2023-04-13T17:39:34Z

This reverts commit 8782c9d (which fixed #131).

Setting the device via the cupy API causes horovod (2GPU) tests to hang with:

[1,1]<stdout>:merlin/models/tf/models/base.py:1387: in fit                                                                                                                          
[1,1]<stdout>:    out = super().fit(**fit_kwargs)                                                                                                                                   
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py:70: in error_handler                                                                            
[1,1]<stdout>:    raise e.with_traceback(filtered_tb) from None                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:78: in __getitem__                                                                             
[1,1]<stdout>:    return self.__next__()                                                                                                                                            
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/tensorflow.py:82: in __next__                                                                                
[1,1]<stdout>:    converted_batch = self.convert_batch(super().__next__())                                                                                                          
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:261: in __next__                                                                              
[1,1]<stdout>:    return self._get_next_batch()                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:332: in _get_next_batch                                                                       
[1,1]<stdout>:    batch = next(self._batch_itr)                                                                                                                                     
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:369: in make_tensors                                                                          
[1,1]<stdout>:    tensors_by_name = self._convert_df_to_tensors(gdf)                                                                                                                
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/nvtx/nvtx.py:101: in inner                                                                                                     
[1,1]<stdout>:    result = func(*args, **kwargs)                                                                                                                                    
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:524: in _convert_df_to_tensors                                                                
[1,1]<stdout>:    tensors_by_name[column_name] = self._to_tensor(gdf_i[[column_name]])                                                                                              
[1,1]<stdout>:/usr/local/lib/python3.8/dist-packages/merlin/dataloader/loader_base.py:453: in _to_tensor                                                                            
[1,1]<stdout>:    with cupy.cuda.Device(self.device):                                                                                                                               
[1,1]<stdout>:cupy/cuda/device.pyx:184: in cupy.cuda.device.Device.__enter__                                                                                                        
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:365: in cupy_backends.cuda.api.runtime.setDevice                                                                                   
[1,1]<stdout>:    ???                                                                                                                                                               
[1,1]<stdout>:_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _                                                                                                                                                                       
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:>   ???                                                                                                                                                               
[1,1]<stdout>:E   cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal                                                                   
[1,1]<stdout>:                                                                                                                                                                      
[1,1]<stdout>:cupy_backends/cuda/api/runtime.pyx:142: CUDARuntimeError

This reverts commit 8782c9d.

Revert "Set device for torch tensors with gpu > 1 (NVIDIA-Merlin#132)"

525f009

This reverts commit 8782c9d.

edknv requested a review from jperez999 April 13, 2023 17:42

edknv self-assigned this Apr 13, 2023

edknv added bug Something isn't working chore labels Apr 13, 2023

edknv added this to the Merlin 23.04 milestone Apr 13, 2023

karlhigley approved these changes Apr 14, 2023

View reviewed changes

edknv merged commit 014b658 into NVIDIA-Merlin:main Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Set device for torch tensors with gpu > 1 (#132)" #134

Revert "Set device for torch tensors with gpu > 1 (#132)" #134

edknv commented Apr 13, 2023 •

edited

Loading

Revert "Set device for torch tensors with gpu > 1 (#132)" #134

Revert "Set device for torch tensors with gpu > 1 (#132)" #134

Conversation

edknv commented Apr 13, 2023 • edited Loading

edknv commented Apr 13, 2023 •

edited

Loading