HERRO's 10Kb input recommendation #34

estolle · 2024-05-27T14:44:44Z

Hi

I am using herro since last weekend when ONT included this into their dorado 0.7 basecaller (i am running through dorado). I have 10.4.1 ONT data of bees (some of them are male, i.e. haploid)

first I had an error about CUDA out of Memory some 618m into the analysis (time /opt/dorado-0.7.0-linux-x64/bin/dorado correct --verbose --threads 40 --device 'cuda:0' -b 56 --infer-threads 2 -m /opt/dorado-0.7.0-linux-x64/bin/herro-v1 $INPUTFOLDER/$INPUT.gz > $SPECIES.dorado.sup430.2kbQ90.herro-v1.fa)

setting the PYTORCH_CUDA_ALLOC_CONF and batchsize to AUTO it worked (but took 2days, basically as long as basecalling, using 19GB out of 24 on my RTX3090)
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:4096
time /opt/dorado-0.7.0-linux-x64/bin/dorado correct --verbose --threads 60 --device 'cuda:0' -b 0 --infer-threads 2 -m /opt/dorado-0.7.0-linux-x64/bin/herro-v1 $INPUTFOLDER/$INPUT.gz > $SPECIES.dorado.sup430.2kbQ90.herro-v1.fa

The resulting assemblies (various flye, hifiasm) look very promising, including some chromosomes T2T, where previously it wasn't T2T. The assemblies are longer, but at similar very good N50 etc stats, i.e. i am quite happy with this thus far.

Just for our data processing we had a question regarding reads less than 10kb.

Should we remove these entirely? We usually have alot of reads around that length, i.e. many that are 2-10kb, while only a certain fraction is above 10kb. The amount of 10kb+ reads sometimes is very good, thus filtering is no problem.

Is it better to run herro on a filtered dataset (10kb+) than all reads (incl the small ones?
If the small reads are included, are they also corrected or remain unmodified?
is it detrimental to leave shorter reads in?

I'm just trying to understand the potential risks/biases that are possible by not pruning the datasets (sometimes the input datasets are not ideally distributed, i.e. too short on average, too few long reads)

Thanks for your recommendation

The text was updated successfully, but these errors were encountered:

dominikstanojevic · 2024-07-09T02:39:36Z

Hello,

sorry for taking some time to respond, we're working on the new manuscript version.

1. and 2. You don't have to automatically remove shorter reads, but most of them (especially those shorter than 5 kbps) will not be used. This is because alignments must span at least one full window.

3. Shorter (uncorredted reads) will not be outputted, maybe I will add the flag in the future to store uncorrected reads into a separate files.

Best,
Dominik

asan-emirsaleh · 2024-09-22T15:20:23Z

Hello @dominikstanojevic !
Thank you for your efforts in open source tools developing! I have a similar question. Our dataset is reduced in quantity (we have ~12x fold coverage of the genome) and biased to shorter sequences (most of data is near 1500-10000 bp). The Phred quality of shorter reads are much better than that of the longer reads. So I interested in utilizing all of our dataset and woundered how can I do that using HERRO?

If the short reads would not be utilized could I filter the dataset by length before processing in order to reduce the computation at the alignment stage?

Best regards
Asan

estolle changed the title ~~HEROO's 10Kb input recommendation~~ HERRO's 10Kb input recommendation May 27, 2024

1Wencai mentioned this issue Dec 2, 2024

herro inference error #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HERRO's 10Kb input recommendation #34

HERRO's 10Kb input recommendation #34

estolle commented May 27, 2024 •

edited

Loading

dominikstanojevic commented Jul 9, 2024 •

edited

Loading

asan-emirsaleh commented Sep 22, 2024

HERRO's 10Kb input recommendation #34

HERRO's 10Kb input recommendation #34

Comments

estolle commented May 27, 2024 • edited Loading

dominikstanojevic commented Jul 9, 2024 • edited Loading

asan-emirsaleh commented Sep 22, 2024

estolle commented May 27, 2024 •

edited

Loading

dominikstanojevic commented Jul 9, 2024 •

edited

Loading