You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using nvmlDeviceGetComputeRunningProcesses to get the usedGpuMemory of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared with nvidia-smi in the terminal, the usedGpuMemory contained the value of the process ID while the pid field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output be nvmlDeviceGetComputeRunningProcesses overall shuffled. Investigation is warranted to ensure nvmlDeviceGetComputeRunningProcesses consistently provides correct output.
Code for reproducing the bug
import pynvml.nvml as nvml
import multiprocess as mp
import torch
def main():
event = mp.Event()
profiling_process = mp.Process(target=_profile_resources, kwargs={'event': event})
profiling_process.start()
with mp.Pool(8) as pool:
for res in [pool.apply_async(_multiprocess_task, (i,)) for i in range(12)]:
res.get()
event.set()
profiling_process.join()
profiling_process.close()
def _profile_resources(event):
nvml.nvmlInit()
while True:
handle = nvml.nvmlDeviceGetHandleByIndex(0)
gpu_processes = nvml.nvmlDeviceGetComputeRunningProcesses(handle)
print(gpu_processes)
time.sleep(.1)
if event.is_set():
break
def _multiprocess_task(num: int):
t1 = torch.tensor([1.1] * int(5**num)).to(torch.device('cuda:0'))
t2 = torch.tensor([2.2] * int(5**num)).to(torch.device('cuda:0'))
time.sleep(1)
return (t1 * t2).shape
The text was updated successfully, but these errors were encountered:
erikhuck
changed the title
**BUG**: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid
*BUG*: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pidOct 31, 2023
erikhuck
changed the title
*BUG*: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid
BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pidOct 31, 2023
erikhuck
changed the title
BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid
_BUG:_ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pidOct 31, 2023
erikhuck
changed the title
_BUG:_ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid
__BUG:__ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pidOct 31, 2023
erikhuck
changed the title
__BUG:__ Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid
**BUG:** Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pidOct 31, 2023
erikhuck
changed the title
**BUG:** Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid
BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pidOct 31, 2023
erikhuck
changed the title
BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid ad usedGpuMemory and usedGpuMemory as pid
BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid as usedGpuMemory and usedGpuMemory as pidOct 31, 2023
Description
When using
nvmlDeviceGetComputeRunningProcesses
to get theusedGpuMemory
of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared withnvidia-smi
in the terminal, theusedGpuMemory
contained the value of the process ID while thepid
field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output benvmlDeviceGetComputeRunningProcesses
overall shuffled. Investigation is warranted to ensurenvmlDeviceGetComputeRunningProcesses
consistently provides correct output.Code for reproducing the bug
Environment
torch==2.0.1
pynvml=11.5.0
CUDA version: 12.2
GPU Model: NVIDIA GeForce RTX 4080
Driver Version: 535.54.03
The text was updated successfully, but these errors were encountered: