Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid as usedGpuMemory and usedGpuMemory as pid #50

Open
erikhuck opened this issue Oct 31, 2023 · 1 comment

Comments

@erikhuck
Copy link

erikhuck commented Oct 31, 2023

Description

When using nvmlDeviceGetComputeRunningProcesses to get the usedGpuMemory of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared with nvidia-smi in the terminal, the usedGpuMemory contained the value of the process ID while the pid field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output be nvmlDeviceGetComputeRunningProcesses overall shuffled. Investigation is warranted to ensure nvmlDeviceGetComputeRunningProcesses consistently provides correct output.

Code for reproducing the bug

import pynvml.nvml as nvml
import multiprocess as mp
import torch

def main():
    event = mp.Event()
    profiling_process = mp.Process(target=_profile_resources, kwargs={'event': event})
    profiling_process.start()
    with mp.Pool(8) as pool:
        for res in [pool.apply_async(_multiprocess_task, (i,)) for i in range(12)]:
            res.get()
    event.set()
    profiling_process.join()
    profiling_process.close()

def _profile_resources(event):
    nvml.nvmlInit()
    while True:
        handle = nvml.nvmlDeviceGetHandleByIndex(0)
        gpu_processes = nvml.nvmlDeviceGetComputeRunningProcesses(handle)
        print(gpu_processes)
        time.sleep(.1)
        if event.is_set():
            break

def _multiprocess_task(num: int):
    t1 = torch.tensor([1.1] * int(5**num)).to(torch.device('cuda:0'))
    t2 = torch.tensor([2.2] * int(5**num)).to(torch.device('cuda:0'))
    time.sleep(1)
    return (t1 * t2).shape

Environment

torch==2.0.1
pynvml=11.5.0
CUDA version: 12.2
GPU Model: NVIDIA GeForce RTX 4080
Driver Version: 535.54.03

@erikhuck erikhuck changed the title **BUG**: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid *BUG*: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid Oct 31, 2023
@erikhuck erikhuck changed the title *BUG*: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid Oct 31, 2023
@erikhuck erikhuck changed the title BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid _BUG:_ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid Oct 31, 2023
@erikhuck erikhuck changed the title _BUG:_ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid __BUG:__ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid Oct 31, 2023
@erikhuck erikhuck changed the title __BUG:__ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid **BUG:** Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid Oct 31, 2023
@erikhuck erikhuck changed the title **BUG:** Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid Oct 31, 2023
@erikhuck erikhuck changed the title BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid as usedGpuMemory and usedGpuMemory as pid Oct 31, 2023
@wookayin
Copy link

wookayin commented Nov 1, 2023

See this: wookayin/gpustat#161 (comment). NVIDIA Driver 535.43~86 are broken and it will report a wrong process information.

Also, this is not the right place for the pynvml package. I recommend you use the official bindings nvidia-ml-py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants