Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train proc manager restarts after Bus error crash, still consumes GPU memory, get OutOfMemoryError #1649

Open
albertz opened this issue Nov 15, 2024 · 0 comments

Comments

@albertz
Copy link
Member

albertz commented Nov 15, 2024

...
ep 42 train, step 2112, ctc_4 1.526, ctc_8 1.055, ctc 0.874, consistency 0.461, aed_ce 0.307, aed_fer 0.050, grad_norm:p2 2.938, num_seqs 45, max_size:time 263912, 
max_size:out-spatial 105, mem_usage:cuda 64.7GB, 0.676 sec/step, elapsed 0:23:21, exp. remaining 0:30:44, complete 43.17%
ep 42 train, step 2113, ctc_4 1.497, ctc_8 0.986, ctc 0.890, consistency 0.558, aed_ce 0.310, aed_fer 0.049, grad_norm:p2 1.657, num_seqs 56, max_size:time 211896, 
max_size:out-spatial 107, mem_usage:cuda 64.7GB, 0.650 sec/step, elapsed 0:23:21, exp. remaining 0:30:43, complete 43.19%
ep 42 train, step 2114, ctc_4 1.520, ctc_8 1.022, ctc 0.878, consistency 0.475, aed_ce 0.300, aed_fer 0.043, grad_norm:p2 2.095, num_seqs 51, max_size:time 231520, 
max_size:out-spatial 101, mem_usage:cuda 64.7GB, 0.657 sec/step, elapsed 0:23:22, exp. remaining 0:30:43, complete 43.21%
Fatal Python error: Bus error

Thread 0x00007f0a47ffe700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f0a67fff700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Thread 0x00007f12c9fff700 (most recent call first):
  <no Python frame>

Thread 0x00007f2136e6a700 (most recent call first):
  File "/usr/lib64/python3.12/threading.py", line 355 in wait
  File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
  File "/usr/lib64/python3.12/threading.py", line 1012 in run
  File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap

Current thread 0x00007f282dcc3240 (most recent call first):
  File "/home/az668407/work/py-envs/py3.12-torch2.5/lib64/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 86 in clip_grad_norm_
  File "/home/az668407/work/py-envs/py3.12-torch2.5/lib64/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 30 in _no_grad_wrapper
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/updater.py", line 227 in step
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 456 in train_epoch
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 255 in train
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 543 in execute_main_task
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 741 in main
  File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>

Extension modules: h5py._errors, h5py.defs, h5py._objects, h5py.h5, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, h5py.u
tils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h
5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random
.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, _cffi_backend, psutil._psutil_linux, psutil.
_psutil_posix, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._
linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imaging, kiwisolver._cext, sentencepiece._sentencepiece (total: 54)

Run ['/home/az668407/work/py-envs/py3.12-torch2.5/bin/python', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py', '/rwthfs/rz/clust
er/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/output/returnn.config']
RETURNN runtime: 37:44:17
RETURNN return code: -7
Most recent trained model epoch: 41 file: /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/
output/models/epoch.041
Most recent trained model epoch before RETURNN run: 0
-> trained successfully 41 epoch(s)
Try again, restart RETURNN...
...
Running in managed mode.
RETURNN starting up, version 1.20241114.200746+git.cb794702, date/time 2024-11-15-03-56-02 (UTC+0100), pid 153085, cwd /rwthfs/rz/cluster/hpcwork/az668407/setups-da
ta/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/work, Python /home/az668407/work/py-envs/py3.12-torch2.5/bin/python
RETURNN command line options: ['/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/output/ret
urnn.config']
Hostname: w23g0009.hpc.itc.rwth-aachen.de
...
Using device: cuda ('gpu' in config)
Using gpu device 0: NVIDIA H100
Total GPU 0 memory 93.0GB, free 20.9GB
...
Starting training at epoch 42, global train step 200611
start epoch 42 global train step 200611 with effective learning rate 0.00018600000000000008 ...
...
OutOfMemoryError: CUDA out of memory. Tried to allocate 290.00 MiB. GPU 0 has a total capacity of 93.00 GiB of which 134.94 MiB is free. Process 118636 has 71.62 GiB memory in use. Including non-PyTorch memory, this process has 21.24 GiB memory in use. Of the allocated memory 19.85 GiB is allocated by PyTorch, and 731.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...

For better readability: OutOfMemoryError: CUDA out of memory. Tried to allocate 290.00 MiB. GPU 0 has a total capacity of 93.00 GiB of which 134.94 MiB is free. Process 118636 has 71.62 GiB memory in use. Including non-PyTorch memory, this process has 21.24 GiB memory in use. Of the allocated memory 19.85 GiB is allocated by PyTorch, and 731.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Note, the training crashed here with a Bus error (#1648 maybe related? other sort of hardware issue). RETURNN was in managed mode (use_train_proc_manager = True), and then restarted. The 118636 proc was the crashed run, as you see also from here:

MEMORY: proc python(118636) exited, old: rss=3.1GB pss=2.8GB uss=2.7GB shared=322.6MB
MEMORY: total (main 118636, 2024-11-15, 03:56:02, 0 procs): pss=0B uss=0B
...
MEMORY: main proc python(153085) initial: rss=108.0MB pss=81.0MB uss=74.8MB shared=33.2MB
MEMORY: sub proc python(153114) initial: rss=13.3MB pss=7.9MB uss=6.2MB shared=7.1MB
MEMORY: sub proc watch memory(153115) initial: rss=50.6MB pss=30.9MB uss=26.4MB shared=24.3MB
MEMORY: total (main 153085, 2024-11-15, 03:56:03, 3 procs): pss=119.8MB uss=107.4MB

You see in the OutOfMemoryError message, that this crashed proc still seems to consume some memory.

I think there is nothing really we can do. RETURNN only restarts once the other proc finished:

        print(f"Run {args}")
        proc = subprocess.Popen(args)
        proc.wait()
        print("RETURNN runtime:", util.hms(time.time() - start_time))
        print("RETURNN return code:", proc.returncode)
        return_code = proc.returncode

So, when we get there, the proc really should have finished. I assume this is some hardware issue.

But anyway, just wanted to report this here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant