Out of memory halfway through #137
Unanswered
YuanbinLiu
asked this question in
Q&A
Replies: 3 comments
-
This is a little weird. Is this happening on the main branch? |
Beta Was this translation helpful? Give feedback.
0 replies
-
It happened during the training. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Could you share the log file so I can have a look. Are you using the main branch or develop? You can reduce the batch size to avoid that. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using A100 (40G) to fit MACE. The program didn't report any errors at the beginning, but ran out of memory at the 170th epoch. The error message is as follows:
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.75 GiB (GPU 0; 39.41 GiB total capacity; 29.35 GiB already allocated; 2.65 GiB free; 35.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
Why would it run out of memory halfway through? Is there any solution to this problem?
Beta Was this translation helpful? Give feedback.
All reactions