dorado 0.8.1 crashes when calling modificaitons #1070

Kirk3gaard · 2024-10-08T08:06:38Z

Issue Report

Please describe the issue:

dorado 0.8.1 crashes when basecalling with modifications error message below

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

Dorado version:0.8.1
Dorado command: dorado basecaller --device "cuda:all" sup /data/zymo_fecal/pod5/ --modified-bases "4mC_5mC 6mA" > PAW77640.dorado0.8.1.bm5.0.0.sup.mod4mC_5mC_6mA.bam
Operating system: Ubuntu 24.04.1 LTS - CUDA 12.4
Hardware (CPUs, Memory, GPUs): 24 CPUs, 64 GB RAM, 2x RTX4090
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
Source data location (on device or networked drive - NFS, etc.): on device
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): FLO-PRO114M,LSK114,N50~8kbp,"a lot", 170 Gbp

Logs

[2024-10-07 12:44:50.602] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-07 12:44:50.602] [info] cuda:1 using chunk size 12288, batch size 224
[2024-10-07 12:44:50.968] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-07 12:44:51.260] [info] cuda:1 using chunk size 6144, batch size 224
terminate called after throwing an instance of 'c10::Error's] Basecalling
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7bfd194389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7bfd129bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7bfd19402958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7bfd1737b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7bfd19415de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x7bfd2036e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7bfd0cc9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7bfd0cd29c3c in /lib/x86_64-linux-gnu/libc.so.6)

dorado-WS5_mods.sh: line 22: 11040 Aborted

The text was updated successfully, but these errors were encountered:

Kirk3gaard · 2024-10-08T08:07:48Z

error message is not too different from #1069

Kirk3gaard · 2024-10-08T08:48:51Z

Seems to run when basecalling individual files. So could be an issue where dorado tries to load too much data at once.

Kirk3gaard · 2024-10-08T11:30:22Z

Celebrated too early. It also crashed on individual files. Error below:
[2024-10-08 12:43:48.956] [info] Running: "basecaller" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-08 12:43:48.981] [info] - downloading [email protected] with httplib
[2024-10-08 12:43:51.228] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-08 12:43:51.494] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-08 12:43:51.746] [info] > Creating basecall pipeline
[2024-10-08 12:43:53.011] [warning] Unable to find chunk benchmarks for GPU "NVIDIA GeForce RTX 4090", model /data/zymo_fecal/.temp_dorado_model-19e8e39af2364333/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-08 12:43:53.058] [warning] Unable to find chunk benchmarks for GPU "NVIDIA GeForce RTX 4090", model /data/zymo_fecal/.temp_dorado_model-19e8e39af2364333/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-08 12:43:57.779] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-08 12:43:57.779] [info] cuda:1 using chunk size 12288, batch size 128
[2024-10-08 12:43:58.147] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-08 12:43:58.158] [info] cuda:1 using chunk size 6144, batch size 128
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7aaf4e0389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7aaf475bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7aaf4e002958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7aaf4bf7b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7aaf4e015de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x7aaf54f6e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7aaf4189ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7aaf41929c3c in /lib/x86_64-linux-gnu/libc.so.6)

Kirk3gaard · 2024-10-08T12:14:49Z

Tried to cut if down to cuda:0 and see whether that will run to completion.

HalfPhoton · 2024-10-08T12:54:03Z

Thanks for the info @Kirk3gaard, we're looking into this.

Kirk3gaard · 2024-10-10T11:34:21Z

Even on one file it also occasionally fails.
[2024-10-09 20:42:22.044] [info] Running: "basecaller" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-09 20:42:22.084] [info] - downloading [email protected] with httplib
[2024-10-09 20:42:24.269] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-09 20:42:24.605] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-09 20:42:24.980] [info] > Creating basecall pipeline
[2024-10-09 20:42:25.782] [warning] Unable to find chunk benchmarks for GPU "NVIDIA GeForce RTX 4090", model /data/zymo_fecal/.temp_dorado_model-589d77c250c4abb2/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-09 20:42:30.319] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-09 20:42:30.689] [info] cuda:0 using chunk size 6144, batch size 128
loop_dorado-WS5_mods.sh: line 30: 57480 Segmentation fault

Tried to go back and use dorado v. 0.7.3 with 2x GPUs. This also failed. So might not be a new issue.
[2024-10-10 11:30:54.591] [info] Running: "basecaller" "--device" "cuda:all" "sup" "/data/zymo_fecal/pod5" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-10 11:30:54.716] [info] - downloading [email protected] with httplib
[2024-10-10 11:30:56.858] [info] - downloading [email protected]_4mC_5mC@v1 with httplib
[2024-10-10 11:30:57.360] [info] - downloading [email protected]_6mA@v1 with httplib
[2024-10-10 11:30:57.903] [info] > Creating basecall pipeline
[2024-10-10 11:31:03.726] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-10 11:31:03.726] [info] cuda:1 using chunk size 12288, batch size 224
[2024-10-10 11:31:04.093] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-10 11:31:04.387] [info] cuda:1 using chunk size 6144, batch size 224
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7215aec389b7 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7215a81bd115 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7215aec02958 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7215acb7b516 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7215aec15de2 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.7.3-linux-x64/bin/dorado() [0xa66f24]
frame #6: + 0x1196e380 (0x7215b5b6e380 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7215a249ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7215a2529c3c in /lib/x86_64-linux-gnu/libc.so.6)

Would you recommend lowering chunk size and batch size manually? Or how do I get some basecalling to actually finish?

HalfPhoton · 2024-10-10T12:25:11Z

It looks like the auto batchsize is not quite right when using mods - please try reducing it manually from 128 to 96.

Kirk3gaard · 2024-10-11T07:34:34Z

Tried that with 2xRTX4090s. Still crashed. Do you think we have two different issues here? 1. MultiGPU and 2. auto batch size not working for mods. I will try to run batch size 96 with one GPU.
Error report:
2024-10-10 15:26:21.739] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-10 15:26:21.767] [info] - downloading [email protected] with httplib
[2024-10-10 15:26:24.163] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-10 15:26:24.419] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-10 15:26:24.676] [info] > Creating basecall pipeline
[2024-10-10 15:26:25.996] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-10 15:26:26.038] [info] cuda:1 using chunk size 12288, batch size 96
[2024-10-10 15:26:26.323] [info] cuda:0 using chunk size 6144, batch size 96
[2024-10-10 15:26:26.344] [info] cuda:1 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x761b4c2389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x761b457bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x761b4c202958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x761b4a17b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x761b4c215de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x761b5316e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x761b3fa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x761b3fb29c3c in /lib/x86_64-linux-gnu/libc.so.6)

iiSeymour · 2024-10-11T10:12:14Z

Thanks CUDA great error message CUDA error: unspecified launch failure 😁

@Kirk3gaard can you help pin point by confirming the following. Do you see this issue with..

just canonical calling?
using a single device (-x cuda:0)?
using a single mod model at a time (either 4mC_5mC or 6mA)?

Kirk3gaard · 2024-10-21T05:53:28Z

Will check that.

Single device failing on a single file:
[2024-10-11 14:15:48.060] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-11 14:15:48.089] [info] - downloading [email protected] with httplib
[2024-10-11 14:15:51.624] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-11 14:15:52.239] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-11 14:15:53.657] [info] > Creating basecall pipeline
[2024-10-11 14:15:54.559] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-11 14:15:54.872] [info] cuda:0 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5fea8389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5fe3dbd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5fea802958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7f5fe877b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7f5fea815de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x7f5ff176e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7f5fde09ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7f5fde129c3c in /lib/x86_64-linux-gnu/libc.so.6)

Kirk3gaard · 2024-10-21T08:56:31Z

Single mod model single device single file:
[2024-10-21 07:56:35.049] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC"
[2024-10-21 07:56:35.081] [info] - downloading [email protected] with httplib
[2024-10-21 07:56:38.745] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-21 07:56:39.341] [info] > Creating basecall pipeline
[2024-10-21 07:56:40.081] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-21 07:56:40.392] [info] cuda:0 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x75c86b4389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x75c8649bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x75c86b402958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x75c86937b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x75c86b415de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x75c87236e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x75c85ec9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x75c85ed29c3c in /lib/x86_64-linux-gnu/libc.so.6)

blawrence-ont mentioned this issue Oct 24, 2024

Dorado 0.8.2 crashes when calling methylation #1098

Closed

Kirk3gaard closed this as completed Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dorado 0.8.1 crashes when calling modificaitons #1070

dorado 0.8.1 crashes when calling modificaitons #1070

Kirk3gaard commented Oct 8, 2024 •

edited

Loading

Kirk3gaard commented Oct 8, 2024

Kirk3gaard commented Oct 8, 2024

Kirk3gaard commented Oct 8, 2024

Kirk3gaard commented Oct 8, 2024

HalfPhoton commented Oct 8, 2024

Kirk3gaard commented Oct 10, 2024

HalfPhoton commented Oct 10, 2024

Kirk3gaard commented Oct 11, 2024

iiSeymour commented Oct 11, 2024

Kirk3gaard commented Oct 21, 2024

Kirk3gaard commented Oct 21, 2024

dorado 0.8.1 crashes when calling modificaitons #1070

dorado 0.8.1 crashes when calling modificaitons #1070

Comments

Kirk3gaard commented Oct 8, 2024 • edited Loading

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs

Kirk3gaard commented Oct 8, 2024

Kirk3gaard commented Oct 8, 2024

Kirk3gaard commented Oct 8, 2024

Kirk3gaard commented Oct 8, 2024

HalfPhoton commented Oct 8, 2024

Kirk3gaard commented Oct 10, 2024

HalfPhoton commented Oct 10, 2024

Kirk3gaard commented Oct 11, 2024

iiSeymour commented Oct 11, 2024

Kirk3gaard commented Oct 21, 2024

Kirk3gaard commented Oct 21, 2024

Kirk3gaard commented Oct 8, 2024 •

edited

Loading