Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dorado 0.8.1 basecaller crashes when basecalling POD5 from two runs together #1069

Open
husamia opened this issue Oct 7, 2024 · 2 comments

Comments

@husamia
Copy link

husamia commented Oct 7, 2024

I encountered a bug in dorado that causes it to crash

When basecalling multiple models and passing POD5 data files from different runs together, the program crashes everytime

dorado basecaller sup,5mCG_5hmCG,6mA then passing --recursive to a folder contains POD5 from two different runs

Run environment:

  • Dorado version:0.8.1
  • Dorado command: dorado basecaller sup,5mCG_5hmCG,6mA
  • Operating system: linux
  • A100 x 4
  • Source data type: POD5

Logs

[2024-10-07 13:07:57.736] [info] Running: "basecaller" "sup,5mCG_5hmCG,6mA" "--recursive" "--models-directory" "/data/lgg_data/R_D/nanopore/script/" "--reference" "/data/lgg_data/R_D/nanopore/reference/minimap2/hg38.no_alt.fa.gz" "/data/lgg_data/R_D/nanopore/raw/HM24385/"
[2024-10-07 13:09:11.181] [info] > Creating basecall pipeline
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:22.899] [warning] Unable to find chunk benchmarks for GPU "NVIDIA A100-SXM4-40GB", model /data/lgg_data/R_D/nanopore/script/[email protected] and chunk size 1728. Full benchmarking will run for this device, which may take some time.
[2024-10-07 13:09:27.368] [info] cuda:3 using chunk size 12288, batch size 416
[2024-10-07 13:09:27.368] [info] cuda:1 using chunk size 12288, batch size 384
[2024-10-07 13:09:27.368] [info] cuda:2 using chunk size 12288, batch size 384
[2024-10-07 13:09:27.368] [info] cuda:0 using chunk size 12288, batch size 416
[2024-10-07 13:09:27.722] [info] cuda:3 using chunk size 6144, batch size 960
[2024-10-07 13:09:27.723] [info] cuda:1 using chunk size 6144, batch size 960
[2024-10-07 13:09:27.724] [info] cuda:0 using chunk size 6144, batch size 960
[2024-10-07 13:09:27.725] [info] cuda:2 using chunk size 6144, batch size 960
terminate called after throwing an instance of 'c10::CuDNNError'
  what():  cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
Exception raised from createCuDNNHandle at /pytorch/pyold/aten/src/ATen/cudnn/Handle.cpp:9 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x14b4ecd599b7 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #1: <unknown function> + 0x3f285b4 (0x14b4e62495b4 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #2: at::native::getCudnnHandle() + 0x725 (0x14b4ead451c5 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #3: <unknown function> + 0x89bf0f6 (0x14b4eace00f6 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #4: <unknown function> + 0x89c00db (0x14b4eace10db in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #5: <unknown function> + 0x89a54ca (0x14b4eacc64ca in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #6: at::native::cudnn_convolution(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x96 (0x14b4eacc6b16 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #7: <unknown function> + 0xa632127 (0x14b4ec953127 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #8: <unknown function> + 0xa6321e0 (0x14b4ec9531e0 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #9: at::_ops::cudnn_convolution::call(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, bool) + 0x23d (0x14b4e7848e9d in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #10: at::native::_convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long, bool, bool, bool, bool) + 0x1505 (0x14b4e6c07cf5 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #11: <unknown function> + 0x58a7496 (0x14b4e7bc8496 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #12: <unknown function> + 0x58a7517 (0x14b4e7bc8517 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #13: at::_ops::_convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long, bool, bool, bool, bool) + 0x29b (0x14b4e73eb0fb in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #14: at::native::convolution(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, bool, c10::ArrayRef<long>, long) + 0x21d (0x14b4e6bfbd3d in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #15: <unknown function> + 0x58a6f55 (0x14b4e7bc7f55 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #16: <unknown function> + 0x58a6fbf (0x14b4e7bc7fbf in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #17: at::_ops::convolution::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<long>, bool, c10::ArrayRef<c10::SymInt>, long) + 0x223 (0x14b4e73ea443 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #18: at::native::conv1d(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x1c5 (0x14b4e6bfef35 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #19: <unknown function> + 0x5a57b31 (0x14b4e7d78b31 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #20: at::_ops::conv1d::call(at::Tensor const&, at::Tensor const&, c10::optional<at::Tensor> const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long) + 0x20c (0x14b4e784698c in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #21: torch::nn::Conv1dImpl::forward(at::Tensor const&) + 0x3a0 (0x14b4ea352a40 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #22: dorado() [0xacf4d8]
frame #23: dorado() [0xad4078]
frame #24: dorado() [0xac0050]
frame #25: dorado() [0xac01b8]
frame #26: dorado() [0xabc643]
frame #27: <unknown function> + 0x1196e380 (0x14b4f3c8f380 in /usr/local/dorado/0.8.1/bin/../lib/libdorado_torch_lib.so)
frame #28: <unknown function> + 0x89c02 (0x14b4e1089c02 in /lib64/libc.so.6)
frame #29: <unknown function> + 0x10ec40 (0x14b4e110ec40 in /lib64/libc.so.6)

/users/husw5y/.lsbatch/1728320850.4622875.shell: line 14: 2021786 Aborted                 (core dumped) dorado basecaller sup,5mCG_5hmCG,6mA --recursive --models-directory /data/lgg_data/R_D/nanopore/script/ --reference /data/lgg_data/R_D/nanopore/reference/minimap2/hg38.no_alt.fa.gz /data/lgg_data/R_D/nanopore/raw/HM24385/ > /data/lgg_data/R_D/nanopore/progress/WGS_HG002/HG002_UL_NA24385_sup_5mCG_5hmCG_6mA_hg38.no_alt_merged.bam
@Kirk3gaard
Copy link

Do you avoid the crash when running on only one folder? Or have you considered testing cuda:0 to check whether it is a multi-gpu issue?

@husamia
Copy link
Author

husamia commented Oct 8, 2024

Do you avoid the crash when running on only one folder? Or have you considered testing cuda:0 to check whether it is a multi-gpu issue?

I confirmed that it doesn't crash when running on only one folder. So I suspect that the meta data from the reads from different runs is not being handled properly causing crash?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants