Segmentation fault error #77

kaysahu · 2023-07-11T00:56:35Z

Hi,
I use dashing2 on two different files:

Genome file: Used dashing 2 basic sketch commands, which nearly took 2 days for processing.

$repo/dashing2 sketch --parse-by-seq --cmpout $outfile $genome

Can you suggest something for improving runtime efficiency?

Protein file: This input file is converted from the genome file in (1), and after processing for 96 hours, I get this error:

Can you please advise regarding this error?

dnbaker · 2023-07-13T01:08:11Z

Hi -

I suggest generating a sparse distance matrix when the number of entries is greater than about 50,000.

You can make it sparse by choosing a minimum similarity (e.g., --similarity-threshold 0.8) or top-k (e.g., --topk 250). Dashing2 then indexes the data and only performs comparisons against near neighbors retrieved from the index.

You can parse the output files yourself, or you can choose binary output and use parsing code from dashing2/python/parse.py. Either way, once you have the sparse matrix, you can feed it into HDBSCAN to cluster quickly.

The alternative is to cluster directly in dashing2 using --greedy <similarity_threshold>, which puts all sequences above threshold into the same cluster and only compares new sequences against the largest sequence in the cluster. It's quick though not as high quality as HDBSCAN can generate.

For genomes, all-pairs is good enough for a lot of applications, but at the level of reads or transcripts you often need to use the sparse computation modes.

Happy to answer more questions, and good luck!

Daniel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault error #77

Segmentation fault error #77

kaysahu commented Jul 11, 2023

dnbaker commented Jul 13, 2023

Segmentation fault error #77

Segmentation fault error #77

Comments

kaysahu commented Jul 11, 2023

dnbaker commented Jul 13, 2023