Error related to magma during training #101

Cdk29 · 2020-12-25T12:39:57Z

@henry090 : I am trying to train a xse_resnet50.

During training I got the following error :

R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.

It is not a simple out of memory error, it seems to be some kind of memory leak related to magma, similar to related here.

But : I did not find any mention of this bug occurring with fastai, which I would have expect if this thing occurred recurrently, except for this message on this thread : https://forums.fast.ai/t/a-walk-with-fastai2-vision-study-group-and-online-lectures-megathread/59929/1293 :

The only “new” thing I am doing is that I am encapsulating most of my code for training the model in a try/except block in a while loop.

I wonder if the memory leak is not somehow due to using a function as a wrapper or reticulate.
Link towards the code and error : https://www.kaggle.com/cdk292/magma-error-xse-resnext50-with-r?scriptVersionId=50229515

The last version is still compiling but you can see in the log of execution of version 4 the error, and will probably shown up again in V6.

PS : merry Christmas.

Cdk29 · 2020-12-25T12:54:56Z

I am trying without the $to_fp16() to see if the error came from there.

turgut090 · 2020-12-25T13:46:44Z

I get the following with to_fp16:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
  AssertionError: Mixed-precision training requires a GPU, remove the call `to_fp16`

Detailed traceback: 
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastcore/logargs.py", line 56, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastai/callback/schedule.py", line 113, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastcore/logargs.py", line 56, in _f
    return inst if to_return else f(*args, **kwargs)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/fastai/learner.py", line 207, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/home/turgut/.local/share/r-miniconda/envs/r-reticulate/lib/python3.

Cdk29 · 2020-12-25T17:18:59Z

Weird. You don't have a GPU ?

I tried without the $to_fp16() but I just got a classic out of memory error. A first glance it does seems to came from that.

With $to_fp16() the full error message looks something like this :

/usr/local/share/.virtualenvs/r-reticulate/lib/python3.7/site-packages/fastai/learner.
py:53: UserWarning: Could not load the optimizer state. 
 if with_opt: warn("Could not load the optimizer state.")
█0         1.302142    1.065968    0.646413  12:58    
█1         1.160781    1.065510    0.620005  12:47  
█2         1.134573    1.001534    0.633092  12:51
R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void 
magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, 
magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.

Any idea on what could happen ?

Cdk29 · 2020-12-25T17:31:02Z

@henry090 : I wonder if it is not a more general memory leak issue as the GitHub issue I linked in the first post suggest. My technical background is light on that matter. (/pytorch/issues/26120)

If so, I am not sure what are the models affected. My guess is that happen on every models but got visible only on the one that require a lot of GPU computation such as the XSE_resnext. Or maybe a specific unit such as the SE but I am just guessing.

turgut090 · 2020-12-25T17:31:32Z

I probably installed the cpu version. Regarding your question I advise to run similar code from python side and see if it works. Btw, did it work without fp16?

…

On Fri, Dec 25, 2020, 9:19 PM Cdk29 ***@***.***> wrote: Weird. You don't have a GPU ? I tried without the $to_fp16() but I just got a classic out of memory error. A first glance it does seems to came from that. With $to_fp16() the full error message looks something like this : /usr/local/share/.virtualenvs/r-reticulate/lib/python3.7/site-packages/fastai/learner. py:53: UserWarning: Could not load the optimizer state. if with_opt: warn("Could not load the optimizer state.") █0 1.302142 1.065968 0.646413 12:58 █1 1.160781 1.065510 0.620005 12:47 █2 1.134573 1.001534 0.633092 12:51 R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t, magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed. Any idea on what could happen ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#101 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJF5GM7LNLTDWHTD7U4QI2DSWTCQ7ANCNFSM4VJDGBUQ> .

Cdk29 · 2020-12-25T17:35:14Z

Same without fp16. Memory error but only after few epoch (!) and I got the same traceback/error if I look into the log.

Looks like a memory leak. I will try with a smaller batch. Or do a really quick and dirty training on 3 epochs before saving the model and restarting in an other kernel.

turgut090 · 2020-12-25T17:46:20Z

I reintalled fastai and now it works fine with batch size == 10. This is the result for 1 epoch:

epoch   train_loss   valid_loss   accuracy   time  
------  -----------  -----------  ---------  ------
0       3.050841     1.718752     0.580000   00:07

system('nvidia-smi',intern = T)
 [1] "Fri Dec 25 21:46:02 2020       "                                                
 [2] "+-----------------------------------------------------------------------------+"
 [3] "| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |"
 [4] "|-------------------------------+----------------------+----------------------+"
 [5] "| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |"
 [6] "| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |"
 [7] "|                               |                      |               MIG M. |"
 [8] "|===============================+======================+======================|"
 [9] "|   0  GeForce RTX 2060    Off  | 00000000:01:00.0  On |                  N/A |"
[10] "|  0%   42C    P8    10W / 170W |   3226MiB /  5931MiB |      4%      Default |"
[11] "|                               |                      |                  N/A |"
[12] "+-------------------------------+----------------------+----------------------+"
[13] "                                                                               "
[14] "+-----------------------------------------------------------------------------+"
[15] "| Processes:                                                                  |"
[16] "|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |"
[17] "|        ID   ID                                                   Usage      |"
[18] "|=============================================================================|"
[19] "|    0   N/A  N/A      1298      G   /usr/lib/xorg/Xorg                 85MiB |"
[20] "|    0   N/A  N/A      2205      G   compiz                             31MiB |"
[21] "|    0   N/A  N/A      4977      G   ...AAAAAAAAA= --shared-files       63MiB |"
[22] "|    0   N/A  N/A      6185      G   /usr/lib/rstudio/bin/rstudio       40MiB |"
[23] "|    0   N/A  N/A      6239      C   .../lib/rstudio/bin/rsession     2999MiB |"
[24] "+-----------------------------------------------------------------------------+"

turgut090 · 2020-12-25T17:48:22Z

This is my full code:

library(fastai)
library(magrittr)



df = data.table::fread("~/Downloads/plant/train.csv")
df[['label']] = as.character(df[['label']])
smp = data.table::fread("~/Downloads/plant/sample_submission.csv")

df = dplyr::sample_n(df,250)

batch_tfms = list(RandomResizedCrop(256),
                  #DeterministicDihedral(),
                  #Warp(), Hue(), Saturation(),
                  Hue(0.2, 0.5),
                  aug_transforms(size = 300, max_rotate = 180.,
                                 max_lighting = 0.2, max_warp = 0.4,
                                 flip_vert = TRUE, max_zoom = 2.),
                  Normalize_from_stats( imagenet_stats() ))

plant = DataBlock(blocks = list(ImageBlock(), CategoryBlock()),
                         get_x = function(x) {paste('plant/train_images', x[[0]], sep = '/')},
                         get_y = ColReader('label'),
                         item_tfms = Resize(300),
                         splitter = RandomSplitter(),
                         batch_tfms = batch_tfms) 

dls = plant %>% dataloaders(df, bs = 10)

dls %>% show_batch(max_n = 10)


learn <- dls %>% cnn_learner(xse_resnext50(), metrics = accuracy) #prettier
learn$to_fp16()
cyc = learn %>% fit_one_cycle(1)

Cdk29 · 2020-12-25T17:49:43Z

Yes. To me it happen only after some epochs, whether it is on my local computer or kaggle.

turgut090 · 2020-12-25T17:53:56Z

I think it is a CUDA error. So, it is not related to the wrapper, IMO. Kaggle output also tells that:

Error in py_call_impl(callable, dots$args, dots$keywords): RuntimeError: CUDA out of memory. 
Tried to allocate 20.00 MiB (GPU 0; 15.90 GiB total capacity; 13.80 GiB already allocated; 
9.75 MiB free; 15.08 GiB reserved in total by PyTorch)

You could try to skip lr_find and see if it might work without that.

With sample data it works fine:

cyc = learn %>% fit_one_cycle(10)
epoch   train_loss   valid_loss   accuracy   time 
------  -----------  -----------  ---------  -----
0       2.654885     1.186562     0.580000   00:07 
1       2.604904     1.621742     0.600000   00:06 
2       2.524296     2.486575     0.260000   00:07 
3       2.336369     1.181870     0.500000   00:07 
4       2.164644     1.734055     0.480000   00:07 
5       2.027067     1.226135     0.580000   00:07 
6       1.933184     1.205350     0.620000   00:07 
7       1.812553     1.124996     0.580000   00:07 
8       1.736114     1.102648     0.600000   00:07 
9       1.677608     1.133445     0.620000   00:07

Cdk29 · 2020-12-25T17:59:14Z

Yes. I just got suspicious because of the only report saying that it was because the guy was making everything into a function. Actually it might have triggered the leak quicker since report that they have to reboot their computer to clear the memory.

I let you close the issue, maybe people will come around if they have this error and see it is a maybe memory leak.

turgut090 · 2021-01-07T14:48:14Z

Hi. Did you manage to solve this issue?

Cdk29 · 2021-01-07T17:10:22Z

@henry090 Hi! Yes and no. The issue cannot be solved per se directly. But I managed to train such a big network by splitting it across a lot of notebooks on kaggle (probably works the same for a computer). It is a version of starting and stopping a computer to clean the memory solve the memory leak on the GPU. (you just need to save the model every time). Actually it made me think that I did not have the time to open an issue related to export(). I will do it know).

turgut090 · 2021-03-02T15:10:35Z

Were you able to train the model with TPUs on Colab?

Cdk29 · 2021-03-02T15:24:58Z

Sorry I did not had the opportunity to train a big model for the last competition on kaggle, next time if I am not struggling too much with other aspect of the competition.

turgut090 added the bug Something isn't working label Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error related to magma during training #101

Error related to magma during training #101

Cdk29 commented Dec 25, 2020 •

edited

Loading

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020 via email

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

turgut090 commented Jan 7, 2021

Cdk29 commented Jan 7, 2021 •

edited

Loading

turgut090 commented Mar 2, 2021

Cdk29 commented Mar 2, 2021

Error related to magma during training #101

Error related to magma during training #101

Comments

Cdk29 commented Dec 25, 2020 • edited Loading

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020 via email

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

turgut090 commented Dec 25, 2020

Cdk29 commented Dec 25, 2020

turgut090 commented Jan 7, 2021

Cdk29 commented Jan 7, 2021 • edited Loading

turgut090 commented Mar 2, 2021

Cdk29 commented Mar 2, 2021

Cdk29 commented Dec 25, 2020 •

edited

Loading

Cdk29 commented Jan 7, 2021 •

edited

Loading