-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error related to magma during training #101
Comments
I am trying without the $to_fp16() to see if the error came from there. |
I get the following with
|
Weird. You don't have a GPU ? I tried without the $to_fp16() but I just got a classic out of memory error. A first glance it does seems to came from that. With $to_fp16() the full error message looks something like this :
Any idea on what could happen ? |
@henry090 : I wonder if it is not a more general memory leak issue as the GitHub issue I linked in the first post suggest. My technical background is light on that matter. (/pytorch/issues/26120) If so, I am not sure what are the models affected. My guess is that happen on every models but got visible only on the one that require a lot of GPU computation such as the XSE_resnext. Or maybe a specific unit such as the SE but I am just guessing. |
I probably installed the cpu version. Regarding your question I advise to
run similar code from python side and see if it works. Btw, did it work
without fp16?
…On Fri, Dec 25, 2020, 9:19 PM Cdk29 ***@***.***> wrote:
Weird. You don't have a GPU ?
I tried without the $to_fp16() but I just got a classic out of memory
error. A first glance it does seems to came from that.
With $to_fp16() the full error message looks something like this :
/usr/local/share/.virtualenvs/r-reticulate/lib/python3.7/site-packages/fastai/learner.
py:53: UserWarning: Could not load the optimizer state.
if with_opt: warn("Could not load the optimizer state.")
█0 1.302142 1.065968 0.646413 12:58
█1 1.160781 1.065510 0.620005 12:47
█2 1.134573 1.001534 0.633092 12:51
R: /opt/conda/conda-bld/magma-cuda101_1583546950098/work/interface_cuda/interface.cpp:901: void
magma_queue_create_from_cuda_internal(magma_device_t, cudaStream_t, cublasHandle_t, cusparseHandle_t,
magma_queue**, const char*, const char*, int): Assertion `queue->dCarray__ != __null' failed.
Any idea on what could happen ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#101 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AJF5GM7LNLTDWHTD7U4QI2DSWTCQ7ANCNFSM4VJDGBUQ>
.
|
Same without fp16. Memory error but only after few epoch (!) and I got the same traceback/error if I look into the log. Looks like a memory leak. I will try with a smaller batch. Or do a really quick and dirty training on 3 epochs before saving the model and restarting in an other kernel. |
I reintalled fastai and now it works fine with batch size == 10. This is the result for 1 epoch:
|
This is my full code:
|
Yes. To me it happen only after some epochs, whether it is on my local computer or kaggle. |
I think it is a CUDA error. So, it is not related to the wrapper, IMO. Kaggle output also tells that:
You could try to skip lr_find and see if it might work without that. With sample data it works fine:
|
Yes. I just got suspicious because of the only report saying that it was because the guy was making everything into a function. Actually it might have triggered the leak quicker since report that they have to reboot their computer to clear the memory. I let you close the issue, maybe people will come around if they have this error and see it is a maybe memory leak. |
Hi. Did you manage to solve this issue? |
@henry090 Hi! Yes and no. The issue cannot be solved per se directly. But I managed to train such a big network by splitting it across a lot of notebooks on kaggle (probably works the same for a computer). It is a version of starting and stopping a computer to clean the memory solve the memory leak on the GPU. (you just need to save the model every time). Actually it made me think that I did not have the time to open an issue related to export(). I will do it know). |
Were you able to train the model with TPUs on Colab? |
Sorry I did not had the opportunity to train a big model for the last competition on kaggle, next time if I am not struggling too much with other aspect of the competition. |
@henry090 : I am trying to train a xse_resnet50.
During training I got the following error :
It is not a simple out of memory error, it seems to be some kind of memory leak related to magma, similar to related here.
But : I did not find any mention of this bug occurring with fastai, which I would have expect if this thing occurred recurrently, except for this message on this thread : https://forums.fast.ai/t/a-walk-with-fastai2-vision-study-group-and-online-lectures-megathread/59929/1293 :
I wonder if the memory leak is not somehow due to using a function as a wrapper or reticulate.
Link towards the code and error : https://www.kaggle.com/cdk292/magma-error-xse-resnext50-with-r?scriptVersionId=50229515
The last version is still compiling but you can see in the log of execution of version 4 the error, and will probably shown up again in V6.
PS : merry Christmas.
The text was updated successfully, but these errors were encountered: