Clusty is slow to read long inputs #2

apcamargo · 2024-10-16T20:21:12Z

I generated a large similarity table from ~1 million genomes using sourmash'es branchwater and tried to cluster it with Clusty. Clusty didn't finish reading the input after 6 hours, while pyLeiden finished reading everything within ~30 min.

Is there a reason Clusty is taking so much time to read the input?

agudys · 2024-10-16T21:06:38Z

Hello!

We manged to run Clusty on distance tables having tens of gigabytes so it seems there is some issue which made Clusty hung on your dataset. How large is your data file and how many distances it contains? Could you please provide me with at least part of it?

Best,
Adam

apcamargo · 2024-10-17T04:55:53Z

I don't have the original input anymore, but a filtered version (which Clusty is also taking a long time to read) is ~34GB with 401,294,724 lines (not counting the header).

Here's a sample with 500k lines.

agudys self-assigned this Oct 16, 2024

agudys added bug Something isn't working and removed bug Something isn't working labels Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusty is slow to read long inputs #2

Clusty is slow to read long inputs #2

apcamargo commented Oct 16, 2024

agudys commented Oct 16, 2024

apcamargo commented Oct 17, 2024

Clusty is slow to read long inputs #2

Clusty is slow to read long inputs #2

Comments

apcamargo commented Oct 16, 2024

agudys commented Oct 16, 2024

apcamargo commented Oct 17, 2024