You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
...
ep 42 train, step 2112, ctc_4 1.526, ctc_8 1.055, ctc 0.874, consistency 0.461, aed_ce 0.307, aed_fer 0.050, grad_norm:p2 2.938, num_seqs 45, max_size:time 263912,
max_size:out-spatial 105, mem_usage:cuda 64.7GB, 0.676 sec/step, elapsed 0:23:21, exp. remaining 0:30:44, complete 43.17%
ep 42 train, step 2113, ctc_4 1.497, ctc_8 0.986, ctc 0.890, consistency 0.558, aed_ce 0.310, aed_fer 0.049, grad_norm:p2 1.657, num_seqs 56, max_size:time 211896,
max_size:out-spatial 107, mem_usage:cuda 64.7GB, 0.650 sec/step, elapsed 0:23:21, exp. remaining 0:30:43, complete 43.19%
ep 42 train, step 2114, ctc_4 1.520, ctc_8 1.022, ctc 0.878, consistency 0.475, aed_ce 0.300, aed_fer 0.043, grad_norm:p2 2.095, num_seqs 51, max_size:time 231520,
max_size:out-spatial 101, mem_usage:cuda 64.7GB, 0.657 sec/step, elapsed 0:23:22, exp. remaining 0:30:43, complete 43.21%
Fatal Python error: Bus error
Thread 0x00007f0a47ffe700 (most recent call first):
File "/usr/lib64/python3.12/threading.py", line 355 in wait
File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
File "/usr/lib64/python3.12/threading.py", line 1012 in run
File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap
Thread 0x00007f0a67fff700 (most recent call first):
File "/usr/lib64/python3.12/threading.py", line 355 in wait
File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
File "/usr/lib64/python3.12/threading.py", line 1012 in run
File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap
Thread 0x00007f12c9fff700 (most recent call first):
<no Python frame>
Thread 0x00007f2136e6a700 (most recent call first):
File "/usr/lib64/python3.12/threading.py", line 355 in wait
File "/usr/lib64/python3.12/multiprocessing/queues.py", line 251 in _feed
File "/usr/lib64/python3.12/threading.py", line 1012 in run
File "/usr/lib64/python3.12/threading.py", line 1075 in _bootstrap_inner
File "/usr/lib64/python3.12/threading.py", line 1032 in _bootstrap
Current thread 0x00007f282dcc3240 (most recent call first):
File "/home/az668407/work/py-envs/py3.12-torch2.5/lib64/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 86 in clip_grad_norm_
File "/home/az668407/work/py-envs/py3.12-torch2.5/lib64/python3.12/site-packages/torch/nn/utils/clip_grad.py", line 30 in _no_grad_wrapper
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/updater.py", line 227 in step
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 456 in train_epoch
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 255 in train
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 543 in execute_main_task
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 741 in main
File "/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>
Extension modules: h5py._errors, h5py.defs, h5py._objects, h5py.h5, numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, h5py.u
tils, h5py.h5t, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5r, h5py._proxy, h5py._conv, h5py.h5z, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5o, h5py.h5f, h
5py.h5fd, h5py.h5pl, h5py.h5l, h5py._selector, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random
.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, markupsafe._speedups, _cffi_backend, psutil._psutil_linux, psutil.
_psutil_posix, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._
linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, PIL._imaging, kiwisolver._cext, sentencepiece._sentencepiece (total: 54)
Run ['/home/az668407/work/py-envs/py3.12-torch2.5/bin/python', '/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/tools/returnn/rnn.py', '/rwthfs/rz/clust
er/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/output/returnn.config']
RETURNN runtime: 37:44:17
RETURNN return code: -7
Most recent trained model epoch: 41 file: /rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/
output/models/epoch.041
Most recent trained model epoch before RETURNN run: 0
-> trained successfully 41 epoch(s)
Try again, restart RETURNN...
...
Running in managed mode.
RETURNN starting up, version 1.20241114.200746+git.cb794702, date/time 2024-11-15-03-56-02 (UTC+0100), pid 153085, cwd /rwthfs/rz/cluster/hpcwork/az668407/setups-da
ta/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/work, Python /home/az668407/work/py-envs/py3.12-torch2.5/bin/python
RETURNN command line options: ['/rwthfs/rz/cluster/home/az668407/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.apqrPjHDRe5M/output/ret
urnn.config']
Hostname: w23g0009.hpc.itc.rwth-aachen.de
...
Using device: cuda ('gpu' in config)
Using gpu device 0: NVIDIA H100
Total GPU 0 memory 93.0GB, free 20.9GB
...
Starting training at epoch 42, global train step 200611
start epoch 42 global train step 200611 with effective learning rate 0.00018600000000000008 ...
...
OutOfMemoryError: CUDA out of memory. Tried to allocate 290.00 MiB. GPU 0 has a total capacity of 93.00 GiB of which 134.94 MiB is free. Process 118636 has 71.62 GiB memory in use. Including non-PyTorch memory, this process has 21.24 GiB memory in use. Of the allocated memory 19.85 GiB is allocated by PyTorch, and 731.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
For better readability: OutOfMemoryError: CUDA out of memory. Tried to allocate 290.00 MiB. GPU 0 has a total capacity of 93.00 GiB of which 134.94 MiB is free. Process 118636 has 71.62 GiB memory in use. Including non-PyTorch memory, this process has 21.24 GiB memory in use. Of the allocated memory 19.85 GiB is allocated by PyTorch, and 731.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Note, the training crashed here with a Bus error (#1648 maybe related? other sort of hardware issue). RETURNN was in managed mode (use_train_proc_manager = True), and then restarted. The 118636 proc was the crashed run, as you see also from here:
For better readability:
OutOfMemoryError: CUDA out of memory. Tried to allocate 290.00 MiB. GPU 0 has a total capacity of 93.00 GiB of which 134.94 MiB is free. Process 118636 has 71.62 GiB memory in use. Including non-PyTorch memory, this process has 21.24 GiB memory in use. Of the allocated memory 19.85 GiB is allocated by PyTorch, and 731.17 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Note, the training crashed here with a
Bus error
(#1648 maybe related? other sort of hardware issue). RETURNN was in managed mode (use_train_proc_manager = True
), and then restarted. The 118636 proc was the crashed run, as you see also from here:You see in the
OutOfMemoryError
message, that this crashed proc still seems to consume some memory.I think there is nothing really we can do. RETURNN only restarts once the other proc finished:
So, when we get there, the proc really should have finished. I assume this is some hardware issue.
But anyway, just wanted to report this here.
The text was updated successfully, but these errors were encountered: