dorado 0.8.1 basecaller crashes when basecalling POD5 from two runs together #1069

husamia · 2024-10-07T19:26:28Z

I encountered a bug in dorado that causes it to crash

When basecalling multiple models and passing POD5 data files from different runs together, the program crashes everytime

dorado basecaller sup,5mCG_5hmCG,6mA then passing --recursive to a folder contains POD5 from two different runs

Run environment:

Dorado version:0.8.1
Dorado command: dorado basecaller sup,5mCG_5hmCG,6mA
Operating system: linux
A100 x 4
Source data type: POD5

Logs

[2024-10-07 13:07:57.736] [info] Running: "basecaller" "sup,5mCG_5hmCG,6mA" "--recursive" "--models-directory" "/data/lgg_data/R_D/nanopore/script/" "--reference" "/data/lgg_data/R_D/nanopore/reference/minimap2/hg38.no_alt.fa.gz" "/data/lgg_data/R_D/nanopore/raw/HM24385/"
[2024-10-07 13:09:11.181] [info] > Creating basecall pipeline
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:27.368] [info] cuda:3 using chunk size 12288, batch size 416
[2024-10-07 13:09:27.368] [info] cuda:1 using chunk size 12288, batch size 384
[2024-10-07 13:09:27.368] [info] cuda:2 using chunk size 12288, batch size 384
[2024-10-07 13:09:27.368] [info] cuda:0 using chunk size 12288, batch size 416
[2024-10-07 13:09:27.722] [info] cuda:3 using chunk size 6144, batch size 960
[2024-10-07 13:09:27.723] [info] cuda:1 using chunk size 6144, batch size 960
[2024-10-07 13:09:27.724] [info] cuda:0 using chunk size 6144, batch size 960
[2024-10-07 13:09:27.725] [info] cuda:2 using chunk size 6144, batch size 960
terminate called after throwing an instance of 'c10::CuDNNError'
  what():  cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Exception raised from createCuDNNHandle at /pytorch/pyold/aten/src/ATen/cudnn/Handle.cpp:9 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14b4ecd599b7 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #1: <unknown function> + 0x3f285b4 (0x14b4e62495b4 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #2: at::native::getCudnnHandle() + 0x725 (0x14b4ead451c5 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x89bf0f6 (0x14b4eace00f6 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x89c00db (0x14b4eace10db in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0x89a54ca (0x14b4eacc64ca in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #6: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x96 (0x14b4eacc6b16 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa632127 (0x14b4ec953127 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa6321e0 (0x14b4ec9531e0 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x23d (0x14b4e7848e9d in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #10: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x1505 (0x14b4e6c07cf5 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #11: <unknown function> + 0x58a7496 (0x14b4e7bc8496 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x58a7517 (0x14b4e7bc8517 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool) + 0x29b (0x14b4e73eb0fb in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #14: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x21d (0x14b4e6bfbd3d in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #15: <unknown function> + 0x58a6f55 (0x14b4e7bc7f55 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #16: <unknown function> + 0x58a6fbf (0x14b4e7bc7fbf in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #17: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) + 0x223 (0x14b4e73ea443 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #18: at::native::conv1d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x1c5 (0x14b4e6bfef35 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #19: <unknown function> + 0x5a57b31 (0x14b4e7d78b31 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #20: at::_ops::conv1d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x20c (0x14b4e784698c in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #21: torch::nn::Conv1dImpl::forward(at::Tensor const&) + 0x3a0 (0x14b4ea352a40 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #22: dorado() [0xacf4d8]
frame #23: dorado() [0xad4078]
frame #24: dorado() [0xac0050]
frame #25: dorado() [0xac01b8]
frame #26: dorado() [0xabc643]
frame #27: <unknown function> + 0x1196e380 (0x14b4f3c8f380 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #28: <unknown function> + 0x89c02 (0x14b4e1089c02 in /lib64/libc.so.6)
frame #29: <unknown function> + 0x10ec40 (0x14b4e110ec40 in /lib64/libc.so.6)

/users/husw5y/.lsbatch/1728320850.4622875.shell: line 14: 2021786 Aborted                 (core dumped) dorado basecaller sup,5mCG_5hmCG,6mA --recursive --models-directory /data/lgg_data/R_D/nanopore/script/ --reference /data/lgg_data/R_D/nanopore/reference/minimap2/hg38.no_alt.fa.gz /data/lgg_data/R_D/nanopore/raw/HM24385/ > /data/lgg_data/R_D/nanopore/progress/WGS_HG002/HG002_UL_NA24385_sup_5mCG_5hmCG_6mA_hg38.no_alt_merged.bam

The text was updated successfully, but these errors were encountered:

Kirk3gaard · 2024-10-08T12:22:25Z

Do you avoid the crash when running on only one folder? Or have you considered testing cuda:0 to check whether it is a multi-gpu issue?

husamia · 2024-10-08T15:19:29Z

Do you avoid the crash when running on only one folder? Or have you considered testing cuda:0 to check whether it is a multi-gpu issue?

I confirmed that it doesn't crash when running on only one folder. So I suspect that the meta data from the reads from different runs is not being handled properly causing crash?

Kirk3gaard mentioned this issue Oct 8, 2024

dorado 0.8.1 crashes when calling modificaitons #1070

Closed

Kirk3gaard mentioned this issue Oct 23, 2024

Dorado 0.8.2 crashes when calling methylation #1098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dorado 0.8.1 basecaller crashes when basecalling POD5 from two runs together #1069

dorado 0.8.1 basecaller crashes when basecalling POD5 from two runs together #1069

husamia commented Oct 7, 2024 •

edited

Loading

Kirk3gaard commented Oct 8, 2024

husamia commented Oct 8, 2024

dorado 0.8.1 basecaller crashes when basecalling POD5 from two runs together #1069

dorado 0.8.1 basecaller crashes when basecalling POD5 from two runs together #1069

Comments

husamia commented Oct 7, 2024 • edited Loading

Run environment:

Logs

Kirk3gaard commented Oct 8, 2024

husamia commented Oct 8, 2024

husamia commented Oct 7, 2024 •

edited

Loading