-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dorado 0.8.1 crashes when calling modificaitons #1070
Comments
error message is not too different from #1069 |
Seems to run when basecalling individual files. So could be an issue where dorado tries to load too much data at once. |
Celebrated too early. It also crashed on individual files. Error below: Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first): |
Tried to cut if down to cuda:0 and see whether that will run to completion. |
Thanks for the info @Kirk3gaard, we're looking into this. |
Even on one file it also occasionally fails. Tried to go back and use dorado v. 0.7.3 with 2x GPUs. This also failed. So might not be a new issue. Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first): Would you recommend lowering chunk size and batch size manually? Or how do I get some basecalling to actually finish? |
It looks like the auto batchsize is not quite right when using mods - please try reducing it manually from 128 to 96. |
Tried that with 2xRTX4090s. Still crashed. Do you think we have two different issues here? 1. MultiGPU and 2. auto batch size not working for mods. I will try to run batch size 96 with one GPU. Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first): |
Thanks CUDA great error message @Kirk3gaard can you help pin point by confirming the following. Do you see this issue with..
|
Will check that. Single device failing on a single file: Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first): |
Single mod model single device single file: Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first): |
Issue Report
Please describe the issue:
dorado 0.8.1 crashes when basecalling with modifications error message below
Steps to reproduce the issue:
Please list any steps to reproduce the issue.
Run environment:
Logs
[2024-10-07 12:44:50.602] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-07 12:44:50.602] [info] cuda:1 using chunk size 12288, batch size 224
[2024-10-07 12:44:50.968] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-07 12:44:51.260] [info] cuda:1 using chunk size 6144, batch size 224
terminate called after throwing an instance of 'c10::Error's] Basecalling
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7bfd194389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7bfd129bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7bfd19402958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7bfd1737b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7bfd19415de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x7bfd2036e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7bfd0cc9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7bfd0cd29c3c in /lib/x86_64-linux-gnu/libc.so.6)
dorado-WS5_mods.sh: line 22: 11040 Aborted
The text was updated successfully, but these errors were encountered: