Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dorado 0.8.1 crashes when calling modificaitons #1070

Closed
Kirk3gaard opened this issue Oct 8, 2024 · 11 comments
Closed

dorado 0.8.1 crashes when calling modificaitons #1070

Kirk3gaard opened this issue Oct 8, 2024 · 11 comments

Comments

@Kirk3gaard
Copy link

Kirk3gaard commented Oct 8, 2024

Issue Report

Please describe the issue:

dorado 0.8.1 crashes when basecalling with modifications error message below

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

  • Dorado version:0.8.1
  • Dorado command: dorado basecaller --device "cuda:all" sup /data/zymo_fecal/pod5/ --modified-bases "4mC_5mC 6mA" > PAW77640.dorado0.8.1.bm5.0.0.sup.mod4mC_5mC_6mA.bam
  • Operating system: Ubuntu 24.04.1 LTS - CUDA 12.4
  • Hardware (CPUs, Memory, GPUs): 24 CPUs, 64 GB RAM, 2x RTX4090
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.): on device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): FLO-PRO114M,LSK114,N50~8kbp,"a lot", 170 Gbp

Logs

[2024-10-07 12:44:50.602] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-07 12:44:50.602] [info] cuda:1 using chunk size 12288, batch size 224
[2024-10-07 12:44:50.968] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-07 12:44:51.260] [info] cuda:1 using chunk size 6144, batch size 224
terminate called after throwing an instance of 'c10::Error's] Basecalling
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7bfd194389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7bfd129bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7bfd19402958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7bfd1737b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7bfd19415de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x7bfd2036e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7bfd0cc9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7bfd0cd29c3c in /lib/x86_64-linux-gnu/libc.so.6)

dorado-WS5_mods.sh: line 22: 11040 Aborted

@Kirk3gaard
Copy link
Author

error message is not too different from #1069

@Kirk3gaard
Copy link
Author

Seems to run when basecalling individual files. So could be an issue where dorado tries to load too much data at once.

@Kirk3gaard
Copy link
Author

Celebrated too early. It also crashed on individual files. Error below:
[2024-10-08 12:43:48.956] [info] Running: "basecaller" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-08 12:43:48.981] [info] - downloading [email protected] with httplib
[2024-10-08 12:43:51.228] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-08 12:43:51.494] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-08 12:43:51.746] [info] > Creating basecall pipeline
[2024-10-08 12:43:53.011] [warning] Unable to find chunk benchmarks for GPU "NVIDIA GeForce RTX 4090", model /data/zymo_fecal/.temp_dorado_model-19e8e39af2364333/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-08 12:43:53.058] [warning] Unable to find chunk benchmarks for GPU "NVIDIA GeForce RTX 4090", model /data/zymo_fecal/.temp_dorado_model-19e8e39af2364333/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-08 12:43:57.779] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-08 12:43:57.779] [info] cuda:1 using chunk size 12288, batch size 128
[2024-10-08 12:43:58.147] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-08 12:43:58.158] [info] cuda:1 using chunk size 6144, batch size 128
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7aaf4e0389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7aaf475bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7aaf4e002958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7aaf4bf7b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7aaf4e015de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x7aaf54f6e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7aaf4189ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7aaf41929c3c in /lib/x86_64-linux-gnu/libc.so.6)

@Kirk3gaard
Copy link
Author

Tried to cut if down to cuda:0 and see whether that will run to completion.

@HalfPhoton
Copy link
Collaborator

Thanks for the info @Kirk3gaard, we're looking into this.

@Kirk3gaard
Copy link
Author

Even on one file it also occasionally fails.
[2024-10-09 20:42:22.044] [info] Running: "basecaller" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-09 20:42:22.084] [info] - downloading [email protected] with httplib
[2024-10-09 20:42:24.269] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-09 20:42:24.605] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-09 20:42:24.980] [info] > Creating basecall pipeline
[2024-10-09 20:42:25.782] [warning] Unable to find chunk benchmarks for GPU "NVIDIA GeForce RTX 4090", model /data/zymo_fecal/.temp_dorado_model-589d77c250c4abb2/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-09 20:42:30.319] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-09 20:42:30.689] [info] cuda:0 using chunk size 6144, batch size 128
loop_dorado-WS5_mods.sh: line 30: 57480 Segmentation fault

Tried to go back and use dorado v. 0.7.3 with 2x GPUs. This also failed. So might not be a new issue.
[2024-10-10 11:30:54.591] [info] Running: "basecaller" "--device" "cuda:all" "sup" "/data/zymo_fecal/pod5" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-10 11:30:54.716] [info] - downloading [email protected] with httplib
[2024-10-10 11:30:56.858] [info] - downloading [email protected]_4mC_5mC@v1 with httplib
[2024-10-10 11:30:57.360] [info] - downloading [email protected]_6mA@v1 with httplib
[2024-10-10 11:30:57.903] [info] > Creating basecall pipeline
[2024-10-10 11:31:03.726] [info] cuda:0 using chunk size 12288, batch size 128
[2024-10-10 11:31:03.726] [info] cuda:1 using chunk size 12288, batch size 224
[2024-10-10 11:31:04.093] [info] cuda:0 using chunk size 6144, batch size 128
[2024-10-10 11:31:04.387] [info] cuda:1 using chunk size 6144, batch size 224
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7215aec389b7 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7215a81bd115 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7215aec02958 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7215acb7b516 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7215aec15de2 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.7.3-linux-x64/bin/dorado() [0xa66f24]
frame #6: + 0x1196e380 (0x7215b5b6e380 in /data/software/dorado-0.7.3-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7215a249ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7215a2529c3c in /lib/x86_64-linux-gnu/libc.so.6)

Would you recommend lowering chunk size and batch size manually? Or how do I get some basecalling to actually finish?

@HalfPhoton
Copy link
Collaborator

It looks like the auto batchsize is not quite right when using mods - please try reducing it manually from 128 to 96.

@Kirk3gaard
Copy link
Author

Tried that with 2xRTX4090s. Still crashed. Do you think we have two different issues here? 1. MultiGPU and 2. auto batch size not working for mods. I will try to run batch size 96 with one GPU.
Error report:
2024-10-10 15:26:21.739] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:all" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-10 15:26:21.767] [info] - downloading [email protected] with httplib
[2024-10-10 15:26:24.163] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-10 15:26:24.419] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-10 15:26:24.676] [info] > Creating basecall pipeline
[2024-10-10 15:26:25.996] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-10 15:26:26.038] [info] cuda:1 using chunk size 12288, batch size 96
[2024-10-10 15:26:26.323] [info] cuda:0 using chunk size 6144, batch size 96
[2024-10-10 15:26:26.344] [info] cuda:1 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x761b4c2389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x761b457bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x761b4c202958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x761b4a17b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x761b4c215de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x761b5316e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x761b3fa9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x761b3fb29c3c in /lib/x86_64-linux-gnu/libc.so.6)

@iiSeymour
Copy link
Member

Thanks CUDA great error message CUDA error: unspecified launch failure 😁

@Kirk3gaard can you help pin point by confirming the following. Do you see this issue with..

  • just canonical calling?
  • using a single device (-x cuda:0)?
  • using a single mod model at a time (either 4mC_5mC or 6mA)?

@Kirk3gaard
Copy link
Author

Will check that.

Single device failing on a single file:
[2024-10-11 14:15:48.060] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC" "6mA"
[2024-10-11 14:15:48.089] [info] - downloading [email protected] with httplib
[2024-10-11 14:15:51.624] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-11 14:15:52.239] [info] - downloading [email protected]_6mA@v2 with httplib
[2024-10-11 14:15:53.657] [info] > Creating basecall pipeline
[2024-10-11 14:15:54.559] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-11 14:15:54.872] [info] cuda:0 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5fea8389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f5fe3dbd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f5fea802958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x7f5fe877b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x7f5fea815de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x7f5ff176e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x7f5fde09ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x7f5fde129c3c in /lib/x86_64-linux-gnu/libc.so.6)

@Kirk3gaard
Copy link
Author

Single mod model single device single file:
[2024-10-21 07:56:35.049] [info] Running: "basecaller" "--batchsize" "96" "--device" "cuda:0" "sup" "current_file/" "--modified-bases" "4mC_5mC"
[2024-10-21 07:56:35.081] [info] - downloading [email protected] with httplib
[2024-10-21 07:56:38.745] [info] - downloading [email protected]_4mC_5mC@v2 with httplib
[2024-10-21 07:56:39.341] [info] > Creating basecall pipeline
[2024-10-21 07:56:40.081] [info] cuda:0 using chunk size 12288, batch size 96
[2024-10-21 07:56:40.392] [info] cuda:0 using chunk size 6144, batch size 96
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/pyold/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x75c86b4389b7 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x75c8649bd115 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x75c86b402958 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #3: + 0x897b516 (0x75c86937b516 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #4: c10::Stream::synchronize() const + 0x82 (0x75c86b415de2 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #5: /data/software/dorado-0.8.1-linux-x64/bin/dorado() [0xabc69e]
frame #6: + 0x1196e380 (0x75c87236e380 in /data/software/dorado-0.8.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #7: + 0x9ca94 (0x75c85ec9ca94 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c3c (0x75c85ed29c3c in /lib/x86_64-linux-gnu/libc.so.6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants