Out of memory halfway through #137

YuanbinLiu · 2023-06-17T18:38:51Z

YuanbinLiu
Jun 17, 2023

I am using A100 (40G) to fit MACE. The program didn't report any errors at the beginning, but ran out of memory at the 170th epoch. The error message is as follows:
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.75 GiB (GPU 0; 39.41 GiB total capacity; 29.35 GiB already allocated; 2.65 GiB free; 35.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Why would it run out of memory halfway through? Is there any solution to this problem?

davkovacs · 2023-06-22T14:28:04Z

davkovacs
Jun 22, 2023
Maintainer

This is a little weird. Is this happening on the main branch?
Does it happen when it is evaluating the validation set or the test set? Or is it literarily during the training?

0 replies

YuanbinLiu · 2023-06-22T14:35:43Z

YuanbinLiu
Jun 22, 2023
Author

It happened during the training.

0 replies

ilyes319 · 2023-06-24T10:49:08Z

ilyes319
Jun 24, 2023
Maintainer

Could you share the log file so I can have a look. Are you using the main branch or develop? You can reduce the batch size to avoid that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory halfway through #137

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Out of memory halfway through #137

YuanbinLiu Jun 17, 2023

Replies: 3 comments

davkovacs Jun 22, 2023 Maintainer

YuanbinLiu Jun 22, 2023 Author

ilyes319 Jun 24, 2023 Maintainer

YuanbinLiu
Jun 17, 2023

davkovacs
Jun 22, 2023
Maintainer

YuanbinLiu
Jun 22, 2023
Author

ilyes319
Jun 24, 2023
Maintainer