-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda OOM / GPU memory leak due to transforms on the GPU (DeepEdit) #6626
Comments
Hi @matt3o, thanks for your detailed investigation here. Hi @ericspod, could you please also share some comments here? |
Hi @matt3o thanks for the work, but yes this is a known issue that usually you don't see with most usage of MONAI which is for relatively straight forward training scripts. I'm pretty sure the issue is what you're describing, Python objects retain a hold of main and GPU memory but are scheduled for deletion by the garbage collector based on main memory usage only, so those taking up little main memory but large GPU chunks need to be explicitly cleaned up. I don't see an easy fix for this. Ideally the garbage collector would be modified to take into account multiple sorts of memory usages, but this is way beyond the scope of anything we can do and what Python is intended for. I don't think this is a MONAI specific problem or even related to Pytorch, it is this property of how garbage-collected languages work which we need to work at. I have had to use the We can ameliorate the problem somewhat by being more careful about tensor handling such that we don't create large numbers of temporary tensors during calculation, reuse tensors and use inline ops when possible, don't repeatedly recreate Python objects which could retain reference to tensors and cause more things to not get cleaned up, and otherwise keep in mind that the collector isn't magical and so could use some help. Is your solution working for you completely then? If not we need to look into the transforms and other code involved to see where references are being retained. Either way we should have a notebook in the tutorials repo on optimisation and pitfalls of this sort. |
Hey @KumoLiu and @ericspod! Thanks for your quick and extensive responses. What would be good at least, is some documentation on how to handle OOMs and to consider trying out the GarbageCollector, to see, if it fixes the issue. So I really like your proposal with a notebook on those pitfalls. As of right now I did not find much information when googling the issues I ran into which was really frustrating. |
Hi all, I've added an issue on the tutorials repo about adding a tutorial on memory management. Let's consider adding this to the features we'd like to target for 1.3. |
Pretty sure by now the problem is linked to this bug report from Pytorch: pytorch/pytorch#50185 In another report the pytorch people describe one other solution: Explicitly deleting everything after usage pytorch/pytorch#20199 (comment). Maybe that helps someone in the future.. |
Describe the bug
I have converted the DeepEdit transforms (https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/deepedit/transforms.py) to run on the GPU instead of the CPU. I just spent over a month of debugging the code since modifying the transforms to run on the GPU did completely keep me from running the code without OOM messages. I will paste some funny images below.
This means I could no longer run the code on the 11 Gb GPU (smaller crop size) or 24 Gb GPU. Having gotten access to a big cluster I tried 50Gb and even 80 Gb GPU and the code still crashed.
Most confusing of all the things, the crashes were apparently random, always at different epochs even when the same code was run twice. The memory usage appeared to be conforming to no pattern.
After debugging my own code for weeks I realized using the Garbage collection that some references are never cleared and the GC count always increases. This insight helped my to find this issue: #3423 which described the problem pretty well.
The problematic and nondeterministic behavior is linked to the garbage collection which only cleans references if they use a lot of memory. This is true for the previous transforms since they were done in the RAM where the orphaned memory areas will be rather big and be cleaned very soon.
This is not true however for GPU pointers in torch which then are cleared at random times but apparently not often enough for the code to work. This also explains why calling torch.cuda.empy_cache() would not bring any relief - the references to the memory still existed even though they were out scope but torch does not know that it can release the GPU memory then.
The fix for this random behavior is to add a GarbageCollector(trigger_event="iteration") into the training and validation handlers.
I did not find any MONAI docs which mention this behaviour, specifically when it comes to debugging OOM or Cudnn errors. However since there already is that GarbageCollector I guess other people must have run into this issue as well which makes it even more frustrating to me.
--> Conclusion: I am not sure if there is an easy solution to this problem. Seeing there are other people running this issue and since this is hard, indeterministic bugs, it is very important to fix it imo. What I do not know is how complex a fix would be, maybe someone here knows more. Also I don't know if this behavior sometimes occurs when using pytorch code only. However if this is MONAI specific it is framework breaking.
As a temporary fix I can add: The overhead for calling the GarbageCollector in my case appears to be neglectable. Maybe this should be a default handler for SupervisedTrainer and SupervisedEvaluator, only to be turned off with a performance flag if needed.
To Reproduce
Run the DeepEdit Code and follow the speedup guide, more specifically move the transforms to the GPU.
In my experience adding ToTensord(keys=("image", "label"), device=device, track_meta=False) at the end of transform is already enough to let the GPU memory run out or at least increase it extremely and most importantly non-deterministically.
I did however rework all of the transforms and moved all of the transforms including FindDiscrepancyRegionsDeepEditd, AddRandomGuidanceDeepEditd and AddGuidanceSignalDeepEditd to the GPU. (Also see #1332 about that)
Expected behavior
No memory leak.
Screenshots
Import info before the images: Training and validation where cropped to a fixed value. So in theory the GPU memory usage should remain constant over the epochs but different between training and validation. The spikes seen in the later images are due to the validation which only ran every 10 epochs. The important hings here is that these spikes do not increase over time.
x axis: iterations, y axis: amount of GPU memory used as returned by nvmlDeviceGetMemoryInfo()
one epoch is about 400 samples for training and 100 for validation
Initial runs of the code
After a few weeks I got it to a point where it ran much more consistently. Interestingly some operations introduce more non-determinism in the GPU memory usage and developed a feeling which ones that were and removed / replaced them with different operations. The result is the image below. However clearly there is still something fishy going on
For comparison how it looks after adding the GarbageCollector (iteration level cleanup)
And using the GarbageCollector but with epoch level cleanup (does only work on the 80Gb GPU, crashes on the 50Gb one. As we can see in the image above the "actually needed memory" is 33Gb for this setting - with GarbageCollection per Iteration we need 71 Gb at least and still it might crash on some bad GC day). What can clearly be seen is a lot more jitter
Environment
This bug exists independent of environment. I did start with the officially recommended one, tried out different CUDA version and in the end upgraded to the most recent torch version to see if maybe it would be fixed there. I will paste the output from the last environment even though I know it will not be supported. You can verify this bug on the default MONAI pip installation as well however.
Additional context
I will publish the code in my masters thesis at the end of September so if it should be necessary, I might be able to share it beforehand.
The text was updated successfully, but these errors were encountered: