GPU memory allocated by make_context cannot be released when exception. #335

reallijie · 2021-12-20T03:29:03Z

Describe the bug
I want to initialize as many cuda contexts as possible in a multi-threaded environment, but when cuda.Device(0).make_context() throws an exception, the GPU memory allocated by make_context cannot be released.

To Reproduce

import time
import logging

import pycuda.driver as cuda
# from PyQt5.QtCore import QThread


class GpuContext(object):
    def __init__(self, name=None):
        self.logging = logging.getLogger('GpuContext - ' + name)
        self.cuda_context = None
        self.cuda_context = cuda.Device(0).make_context()

    def __del__(self):
        if self.cuda_context:
            try:
                self.cuda_context.pop()
            except:
                self.logging.exception('self.cuda_context.pop()')

            try:
                self.cuda_context.detach()
            except:
                self.logging.exception('self.cuda_context.detach() error')

            try:
                del self.cuda_context
                self.cuda_context = None
            except:
                self.logging.exception('del self.cuda_context error')


# class GpuThread(QThread):
#     def __init__(self, gpuContext, parent=None):
#         super().__init__(parent)
#         self.gpuContext = gpuContext
#
#     def run(self):
#         pass


if __name__ == '__main__':
    cuda.init()
    gupContexts = []
    for i in range(100):
        try:
            gpuContext = GpuContext(str(i))
            gupContexts.append(gpuContext)
        except:
            logging.exception('init error')
            break

    while len(gupContexts) > 0:
        gpuContext = gupContexts.pop()
        del gpuContext

    while True:
        print('main')
        time.sleep(1)

Environment (please complete the following information):

OS: ubuntu18
CUDA version: 10.2, V10.2.89
CUDA driver version: 460.91.03
PyCUDA version: pycuda-2021.1
Python version: 3.6.9

The text was updated successfully, but these errors were encountered:

inducer · 2021-12-20T03:33:53Z

What result are you observing? What result are you expecting? The error handling in that code path looks RAII-safe, so it should do the right thing:

pycuda/src/cpp/cuda.hpp

Lines 854 to 863 in 9f3b898

    
           boost::shared_ptr<context> device::make_context(unsigned int flags) 
        
           { 
        
             context::prepare_context_switch(); 
        
             CUcontext ctx; 
        
             CUDAPP_CALL_GUARDED_THREADED(cuCtxCreate, (&ctx, flags, m_device)); 
        
             boost::shared_ptr<context> result(new context(ctx)); 
        
             context_stack::get().push(result); 
        
             return result; 
        
           }

reallijie · 2021-12-20T03:34:57Z

ERROR:root:init error
Traceback (most recent call last):
  File "test_r.py", line 47, in <module>
    gpuContext = GpuContext(str(i))
  File "test_r.py", line 12, in __init__
    self.cuda_context = cuda.Device(0).make_context()
pycuda._driver.MemoryError: cuCtxCreate failed: out of memory
ERROR:GpuContext - 36:self.cuda_context.pop()
Traceback (most recent call last):
  File "test_r.py", line 17, in __del__
    self.cuda_context.pop()
pycuda._driver.LogicError: cuCtxPopCurrent failed: invalid device context
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuCtxDetach failed: invalid device context

reallijie · 2021-12-20T03:44:36Z

When an exception occurs, how to release the GPU memory, as shown in the figure, 39M GPU memory.

reallijie · 2021-12-20T04:00:07Z

It's not easy to explain.
Situation 1: Under normal circumstances, on a laptop (different graphics cards), when all cuda contexts are released, GPU memory is not occupied. On my desktop, all cuda contexts are released normally, but some GPUs are still occupied.

Scenario 2: Deliberately generate multiple exceptions and find that the GPU memory usage has increased.

reallijie · 2021-12-20T04:00:33Z

import time
import logging

import pycuda.driver as cuda
# from PyQt5.QtCore import QThread


class GpuContext(object):
    def __init__(self, name=None):
        self.logging = logging.getLogger('GpuContext - ' + name)
        self.cuda_context = None
        self.cuda_context = cuda.Device(0).make_context()

    def __del__(self):
        if self.cuda_context:
            try:
                self.cuda_context.pop()
            except:
                self.logging.exception('self.cuda_context.pop()')

            try:
                self.cuda_context.detach()
            except:
                self.logging.exception('self.cuda_context.detach() error')

            try:
                del self.cuda_context
                self.cuda_context = None
            except:
                self.logging.exception('del self.cuda_context error')


# class GpuThread(QThread):
#     def __init__(self, gpuContext, parent=None):
#         super().__init__(parent)
#         self.gpuContext = gpuContext
#
#     def run(self):
#         pass


if __name__ == '__main__':
    cuda.init()
    for j in range(10):
        gupContexts = []
        for i in range(100):
            try:
                gpuContext = GpuContext(str(i))
                gupContexts.append(gpuContext)
            except:
                logging.exception('init error')
                break

        while len(gupContexts) > 0:
            gpuContext = gupContexts.pop()
            del gpuContext

    while True:
        print('main')
        time.sleep(1)

reallijie · 2021-12-20T04:06:04Z

reallijie · 2021-12-20T04:06:14Z

Throwing 10 exceptions, GPU memory occupies 390M.

reallijie · 2021-12-20T04:30:53Z

There may be a problem with the make_context method. prepare_context_switch may cause the context to be switched, but the GPU memory allocation is unsuccessful. Therefore, the pop of the previous context is unsuccessful, resulting in the cuCtxPopCurrent failed: invalid device context exception.

reallijie · 2021-12-20T04:33:27Z

CuCtxPopCurrent failed: invalid device context exception caused by unsuccessful pop of the previous context. So the GPU memory is not released?

inducer · 2021-12-20T04:35:40Z

I suspect that prepare_context_switch leaves the context stack in an inconsistent state in case of an error. It should be replaced with a RAII construct that restores the previous state if the switch did not succeed.

inducer · 2021-12-20T04:36:09Z

It'll be a while before I have time to look into this. PRs welcome in the meantime!

reallijie added the bug label Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory allocated by make_context cannot be released when exception. #335

GPU memory allocated by make_context cannot be released when exception. #335

reallijie commented Dec 20, 2021 •

edited

Loading

inducer commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

inducer commented Dec 20, 2021

inducer commented Dec 20, 2021

GPU memory allocated by make_context cannot be released when exception. #335

GPU memory allocated by make_context cannot be released when exception. #335

Comments

reallijie commented Dec 20, 2021 • edited Loading

inducer commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

reallijie commented Dec 20, 2021

inducer commented Dec 20, 2021

inducer commented Dec 20, 2021

reallijie commented Dec 20, 2021 •

edited

Loading