Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Out Of Memory issue #348

Open
emil-peters opened this issue Oct 28, 2024 · 3 comments
Open

CUDA Out Of Memory issue #348

emil-peters opened this issue Oct 28, 2024 · 3 comments

Comments

@emil-peters
Copy link
Contributor

As far as I understand it, and during some testing I kept on getting Cuda OOM errors while running code with pyinstrument where multiple models were run one after another.
While making sure there was no reference kept to the tensors in the python code, I kept on getting CUDA OOM errors when using pyinstrument. But once disabled the errors disappeared and my VRAM reset as expected after each reference was deleted.

Is there an option to ensure pyinstrument clears its references to onnx and torch tensors, especially after calling del tensor.
As I'd like to keep using pyinstrument but it's not feasible atm.

  • Emil
@Aedial
Copy link

Aedial commented Nov 6, 2024

I have a similar problem where a relatively heavy object is not garbage collected when I leave the context, even with del (python 3.12, interval = 0.1). The growth shows rather starkly on tracemalloc, with the number of objects growing by exactly the number of instantiations (or a multiple of). This results in an OOM of the whole process after a few minutes.
Such behavior only occurs when using pyinstrument, the RAM usage staying stable with any other profiler. I have been using pyinstrument for years and I don't recall such a problem before (perhaps with changing from 3.7 to 3.12?). Might be related to #296.

@davidemassarenti-optio3
Copy link

I'm encountering a similar problem. I tracked it down to calls to output_html.

        profiler.stop()
        profiler.output_html()
        profiler.reset()

Using 4.6.2, memory usage (max RSS) climbs ~2MB over 100 profiling sessions.
Using 5.0.0, memory usage climbs ~40MB for the same number of sessions.

If I comment out the call to output_html, the memory stays steady

@xiaobanni
Copy link

xiaobanni commented Nov 18, 2024

As far as I understand it, and during some testing I kept on getting Cuda OOM errors while running code with pyinstrument where multiple models were run one after another. While making sure there was no reference kept to the tensors in the python code, I kept on getting CUDA OOM errors when using pyinstrument. But once disabled the errors disappeared and my VRAM reset as expected after each reference was deleted.

Is there an option to ensure pyinstrument clears its references to onnx and torch tensors, especially after calling del tensor. As I'd like to keep using pyinstrument but it's not feasible atm.

  • Emil

I am facing the same question. The code uses torch gpu runs well with python, but encounters torch.OutOfMemoryError: CUDA out of memory. when starts with pyinstrument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants