You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I suggest generating a sparse distance matrix when the number of entries is greater than about 50,000.
You can make it sparse by choosing a minimum similarity (e.g., --similarity-threshold 0.8) or top-k (e.g., --topk 250). Dashing2 then indexes the data and only performs comparisons against near neighbors retrieved from the index.
You can parse the output files yourself, or you can choose binary output and use parsing code from dashing2/python/parse.py. Either way, once you have the sparse matrix, you can feed it into HDBSCAN to cluster quickly.
The alternative is to cluster directly in dashing2 using --greedy <similarity_threshold>, which puts all sequences above threshold into the same cluster and only compares new sequences against the largest sequence in the cluster. It's quick though not as high quality as HDBSCAN can generate.
For genomes, all-pairs is good enough for a lot of applications, but at the level of reads or transcripts you often need to use the sparse computation modes.
Hi,
I use dashing2 on two different files:
$repo/dashing2 sketch --parse-by-seq --cmpout $outfile $genome
Can you suggest something for improving runtime efficiency?
Can you please advise regarding this error?
The text was updated successfully, but these errors were encountered: