Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault error #77

Open
kaysahu opened this issue Jul 11, 2023 · 1 comment
Open

Segmentation fault error #77

kaysahu opened this issue Jul 11, 2023 · 1 comment

Comments

@kaysahu
Copy link

kaysahu commented Jul 11, 2023

Hi,
I use dashing2 on two different files:

  1. Genome file: Used dashing 2 basic sketch commands, which nearly took 2 days for processing.

$repo/dashing2 sketch --parse-by-seq --cmpout $outfile $genome

Can you suggest something for improving runtime efficiency?

  1. Protein file: This input file is converted from the genome file in (1), and after processing for 96 hours, I get this error:

image

Can you please advise regarding this error?

@dnbaker
Copy link
Owner

dnbaker commented Jul 13, 2023

Hi -

I suggest generating a sparse distance matrix when the number of entries is greater than about 50,000.

You can make it sparse by choosing a minimum similarity (e.g., --similarity-threshold 0.8) or top-k (e.g., --topk 250). Dashing2 then indexes the data and only performs comparisons against near neighbors retrieved from the index.

You can parse the output files yourself, or you can choose binary output and use parsing code from dashing2/python/parse.py. Either way, once you have the sparse matrix, you can feed it into HDBSCAN to cluster quickly.

The alternative is to cluster directly in dashing2 using --greedy <similarity_threshold>, which puts all sequences above threshold into the same cluster and only compares new sequences against the largest sequence in the cluster. It's quick though not as high quality as HDBSCAN can generate.

For genomes, all-pairs is good enough for a lot of applications, but at the level of reads or transcripts you often need to use the sparse computation modes.

Happy to answer more questions, and good luck!

Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants