Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error related to magma during training #101

Open
Cdk29 opened this issue Dec 25, 2020 · 15 comments
Open

Error related to magma during training #101

Cdk29 opened this issue Dec 25, 2020 · 15 comments
Labels
bug Something isn't working

Comments

@Cdk29
Copy link
Contributor

Cdk29 commented Dec 25, 2020

@henry090 : I am trying to train a xse_resnet50.

During training I got the following error :

R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.

It is not a simple out of memory error, it seems to be some kind of memory leak related to magma, similar to related here.

But : I did not find any mention of this bug occurring with fastai, which I would have expect if this thing occurred recurrently, except for this message on this thread : https://forums.fast.ai/t/a-walk-with-fastai2-vision-study-group-and-online-lectures-megathread/59929/1293 :

The only “new” thing I am doing is that I am encapsulating most of my code for training the model in a try/except block in a while loop.

I wonder if the memory leak is not somehow due to using a function as a wrapper or reticulate.
Link towards the code and error : https://www.kaggle.com/cdk292/magma-error-xse-resnext50-with-r?scriptVersionId=50229515

The last version is still compiling but you can see in the log of execution of version 4 the error, and will probably shown up again in V6.

PS : merry Christmas.

@Cdk29
Copy link
Contributor Author

Cdk29 commented Dec 25, 2020

I am trying without the $to_fp16() to see if the error came from there.

@turgut090
Copy link
Member

I get the following with to_fp16:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  AssertionError: Mixed-precision training requires a GPU, remove the call `to_fp16`

Detailed traceback: 
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastcore/logargs.py", line 56, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastai/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastcore/logargs.py", line 56, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastai/learner.py", line 207, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.

@Cdk29
Copy link
Contributor Author

Cdk29 commented Dec 25, 2020

Weird. You don't have a GPU ?

I tried without the $to_fp16() but I just got a classic out of memory error. A first glance it does seems to came from that.

With $to_fp16() the full error message looks something like this :

/usr/local/share/.virtualenvs/r-reticulate/lib/python3.7/site-packages/fastai/learner.
py:53: UserWarning: Could not load the optimizer state. 
 if with_opt: warn("Could not load the optimizer state.")
█0         1.302142    1.065968    0.646413  12:58    
█1         1.160781    1.065510    0.620005  12:47  
█2         1.134573    1.001534    0.633092  12:51
R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void 
magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, 
magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.

Any idea on what could happen ?

@Cdk29
Copy link
Contributor Author

Cdk29 commented Dec 25, 2020

@henry090 : I wonder if it is not a more general memory leak issue as the GitHub issue I linked in the first post suggest. My technical background is light on that matter. (/pytorch/issues/26120)

If so, I am not sure what are the models affected. My guess is that happen on every models but got visible only on the one that require a lot of GPU computation such as the XSE_resnext. Or maybe a specific unit such as the SE but I am just guessing.

@turgut090
Copy link
Member

turgut090 commented Dec 25, 2020 via email

@Cdk29
Copy link
Contributor Author

Cdk29 commented Dec 25, 2020

Same without fp16. Memory error but only after few epoch (!) and I got the same traceback/error if I look into the log.

Looks like a memory leak. I will try with a smaller batch. Or do a really quick and dirty training on 3 epochs before saving the model and restarting in an other kernel.

@turgut090
Copy link
Member

I reintalled fastai and now it works fine with batch size == 10. This is the result for 1 epoch:

epoch   train_loss   valid_loss   accuracy   time  
------  -----------  -----------  ---------  ------
0       3.050841     1.718752     0.580000   00:07 
system('nvidia-smi',intern = T)
 [1] "Fri Dec 25 21:46:02 2020       "                                                
 [2] "+-----------------------------------------------------------------------------+"
 [3] "| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |"
 [4] "|-------------------------------+----------------------+----------------------+"
 [5] "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |"
 [6] "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |"
 [7] "|                               |                      |               MIG M. |"
 [8] "|===============================+======================+======================|"
 [9] "|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |"
[10] "|  0%   42C    P8    10W / 170W |   3226MiB /  5931MiB |      4%      Default |"
[11] "|                               |                      |                  N/A |"
[12] "+-------------------------------+----------------------+----------------------+"
[13] "                                                                               "
[14] "+-----------------------------------------------------------------------------+"
[15] "| Processes:                                                                  |"
[16] "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |"
[17] "|        ID   ID                                                   Usage      |"
[18] "|=============================================================================|"
[19] "|    0   N/A  N/A      1298      G   /usr/lib/xorg/Xorg                 85MiB |"
[20] "|    0   N/A  N/A      2205      G   compiz                             31MiB |"
[21] "|    0   N/A  N/A      4977      G   ...AAAAAAAAA= --shared-files       63MiB |"
[22] "|    0   N/A  N/A      6185      G   /usr/lib/rstudio/bin/rstudio       40MiB |"
[23] "|    0   N/A  N/A      6239      C   .../lib/rstudio/bin/rsession     2999MiB |"
[24] "+-----------------------------------------------------------------------------+"

@turgut090
Copy link
Member

This is my full code:

library(fastai)
library(magrittr)



df = data.table::fread("~/Downloads/plant/train.csv")
df[['label']] = as.character(df[['label']])
smp = data.table::fread("~/Downloads/plant/sample_submission.csv")

df = dplyr::sample_n(df,250)

batch_tfms = list(RandomResizedCrop(256),
                  #DeterministicDihedral(),
                  #Warp(), Hue(), Saturation(),
                  Hue(0.2, 0.5),
                  aug_transforms(size = 300, max_rotate = 180.,
                                 max_lighting = 0.2, max_warp = 0.4,
                                 flip_vert = TRUE, max_zoom = 2.),
                  Normalize_from_stats( imagenet_stats() ))

plant = DataBlock(blocks = list(ImageBlock(), CategoryBlock()),
                         get_x = function(x) {paste('plant/train_images', x[[0]], sep = '/')},
                         get_y = ColReader('label'),
                         item_tfms = Resize(300),
                         splitter = RandomSplitter(),
                         batch_tfms = batch_tfms) 

dls = plant %>% dataloaders(df, bs = 10)

dls %>% show_batch(max_n = 10)


learn <- dls %>% cnn_learner(xse_resnext50(), metrics = accuracy) #prettier
learn$to_fp16()
cyc = learn %>% fit_one_cycle(1)

@Cdk29
Copy link
Contributor Author

Cdk29 commented Dec 25, 2020

Yes. To me it happen only after some epochs, whether it is on my local computer or kaggle.

@turgut090
Copy link
Member

I think it is a CUDA error. So, it is not related to the wrapper, IMO. Kaggle output also tells that:

Error in py_call_impl(callable, dots$args, dots$keywords): RuntimeError: CUDA out of memory. 
Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 13.80 GiB already allocated; 
9.75 MiB free; 15.08 GiB reserved in total by PyTorch)

You could try to skip lr_find and see if it might work without that.

With sample data it works fine:

cyc = learn %>% fit_one_cycle(10)
epoch   train_loss   valid_loss   accuracy   time 
------  -----------  -----------  ---------  -----
0       2.654885     1.186562     0.580000   00:07 
1       2.604904     1.621742     0.600000   00:06 
2       2.524296     2.486575     0.260000   00:07 
3       2.336369     1.181870     0.500000   00:07 
4       2.164644     1.734055     0.480000   00:07 
5       2.027067     1.226135     0.580000   00:07 
6       1.933184     1.205350     0.620000   00:07 
7       1.812553     1.124996     0.580000   00:07 
8       1.736114     1.102648     0.600000   00:07 
9       1.677608     1.133445     0.620000   00:07 

@Cdk29
Copy link
Contributor Author

Cdk29 commented Dec 25, 2020

Yes. I just got suspicious because of the only report saying that it was because the guy was making everything into a function. Actually it might have triggered the leak quicker since report that they have to reboot their computer to clear the memory.

I let you close the issue, maybe people will come around if they have this error and see it is a maybe memory leak.

@turgut090
Copy link
Member

Hi. Did you manage to solve this issue?

@Cdk29
Copy link
Contributor Author

Cdk29 commented Jan 7, 2021

@henry090 Hi! Yes and no. The issue cannot be solved per se directly. But I managed to train such a big network by splitting it across a lot of notebooks on kaggle (probably works the same for a computer). It is a version of starting and stopping a computer to clean the memory solve the memory leak on the GPU. (you just need to save the model every time). Actually it made me think that I did not have the time to open an issue related to export(). I will do it know).

@turgut090 turgut090 added the bug Something isn't working label Jan 8, 2021
@turgut090
Copy link
Member

Were you able to train the model with TPUs on Colab?

@Cdk29
Copy link
Contributor Author

Cdk29 commented Mar 2, 2021

Sorry I did not had the opportunity to train a big model for the last competition on kaggle, next time if I am not struggling too much with other aspect of the competition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants