Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory allocated by make_context cannot be released when exception. #335

Open
reallijie opened this issue Dec 20, 2021 · 11 comments
Open
Labels

Comments

@reallijie
Copy link

reallijie commented Dec 20, 2021

Describe the bug
I want to initialize as many cuda contexts as possible in a multi-threaded environment, but when cuda.Device(0).make_context() throws an exception, the GPU memory allocated by make_context cannot be released.

To Reproduce

import time
import logging

import pycuda.driver as cuda
# from PyQt5.QtCore import QThread


class GpuContext(object):
    def __init__(self, name=None):
        self.logging = logging.getLogger('GpuContext - ' + name)
        self.cuda_context = None
        self.cuda_context = cuda.Device(0).make_context()

    def __del__(self):
        if self.cuda_context:
            try:
                self.cuda_context.pop()
            except:
                self.logging.exception('self.cuda_context.pop()')

            try:
                self.cuda_context.detach()
            except:
                self.logging.exception('self.cuda_context.detach() error')

            try:
                del self.cuda_context
                self.cuda_context = None
            except:
                self.logging.exception('del self.cuda_context error')


# class GpuThread(QThread):
#     def __init__(self, gpuContext, parent=None):
#         super().__init__(parent)
#         self.gpuContext = gpuContext
#
#     def run(self):
#         pass


if __name__ == '__main__':
    cuda.init()
    gupContexts = []
    for i in range(100):
        try:
            gpuContext = GpuContext(str(i))
            gupContexts.append(gpuContext)
        except:
            logging.exception('init error')
            break

    while len(gupContexts) > 0:
        gpuContext = gupContexts.pop()
        del gpuContext

    while True:
        print('main')
        time.sleep(1)


Environment (please complete the following information):

  • OS: ubuntu18
  • CUDA version: 10.2, V10.2.89
  • CUDA driver version: 460.91.03
  • PyCUDA version: pycuda-2021.1
  • Python version: 3.6.9
@reallijie reallijie added the bug label Dec 20, 2021
@inducer
Copy link
Owner

inducer commented Dec 20, 2021

What result are you observing? What result are you expecting? The error handling in that code path looks RAII-safe, so it should do the right thing:

pycuda/src/cpp/cuda.hpp

Lines 854 to 863 in 9f3b898

boost::shared_ptr<context> device::make_context(unsigned int flags)
{
context::prepare_context_switch();
CUcontext ctx;
CUDAPP_CALL_GUARDED_THREADED(cuCtxCreate, (&ctx, flags, m_device));
boost::shared_ptr<context> result(new context(ctx));
context_stack::get().push(result);
return result;
}

@reallijie
Copy link
Author

ERROR:root:init error
Traceback (most recent call last):
  File "test_r.py", line 47, in <module>
    gpuContext = GpuContext(str(i))
  File "test_r.py", line 12, in __init__
    self.cuda_context = cuda.Device(0).make_context()
pycuda._driver.MemoryError: cuCtxCreate failed: out of memory
ERROR:GpuContext - 36:self.cuda_context.pop()
Traceback (most recent call last):
  File "test_r.py", line 17, in __del__
    self.cuda_context.pop()
pycuda._driver.LogicError: cuCtxPopCurrent failed: invalid device context
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuCtxDetach failed: invalid device context

@reallijie
Copy link
Author

111
When an exception occurs, how to release the GPU memory, as shown in the figure, 39M GPU memory.

@reallijie
Copy link
Author

It's not easy to explain.
Situation 1: Under normal circumstances, on a laptop (different graphics cards), when all cuda contexts are released, GPU memory is not occupied. On my desktop, all cuda contexts are released normally, but some GPUs are still occupied.

Scenario 2: Deliberately generate multiple exceptions and find that the GPU memory usage has increased.

@reallijie
Copy link
Author

import time
import logging

import pycuda.driver as cuda
# from PyQt5.QtCore import QThread


class GpuContext(object):
    def __init__(self, name=None):
        self.logging = logging.getLogger('GpuContext - ' + name)
        self.cuda_context = None
        self.cuda_context = cuda.Device(0).make_context()

    def __del__(self):
        if self.cuda_context:
            try:
                self.cuda_context.pop()
            except:
                self.logging.exception('self.cuda_context.pop()')

            try:
                self.cuda_context.detach()
            except:
                self.logging.exception('self.cuda_context.detach() error')

            try:
                del self.cuda_context
                self.cuda_context = None
            except:
                self.logging.exception('del self.cuda_context error')


# class GpuThread(QThread):
#     def __init__(self, gpuContext, parent=None):
#         super().__init__(parent)
#         self.gpuContext = gpuContext
#
#     def run(self):
#         pass


if __name__ == '__main__':
    cuda.init()
    for j in range(10):
        gupContexts = []
        for i in range(100):
            try:
                gpuContext = GpuContext(str(i))
                gupContexts.append(gpuContext)
            except:
                logging.exception('init error')
                break

        while len(gupContexts) > 0:
            gpuContext = gupContexts.pop()
            del gpuContext

    while True:
        print('main')
        time.sleep(1)

@reallijie
Copy link
Author

1

@reallijie
Copy link
Author

Throwing 10 exceptions, GPU memory occupies 390M.

@reallijie
Copy link
Author

There may be a problem with the make_context method. prepare_context_switch may cause the context to be switched, but the GPU memory allocation is unsuccessful. Therefore, the pop of the previous context is unsuccessful, resulting in the cuCtxPopCurrent failed: invalid device context exception.

@reallijie
Copy link
Author

CuCtxPopCurrent failed: invalid device context exception caused by unsuccessful pop of the previous context. So the GPU memory is not released?

@inducer
Copy link
Owner

inducer commented Dec 20, 2021

I suspect that prepare_context_switch leaves the context stack in an inconsistent state in case of an error. It should be replaced with a RAII construct that restores the previous state if the switch did not succeed.

@inducer
Copy link
Owner

inducer commented Dec 20, 2021

It'll be a while before I have time to look into this. PRs welcome in the meantime!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants