Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusty is slow to read long inputs #2

Open
apcamargo opened this issue Oct 16, 2024 · 2 comments
Open

Clusty is slow to read long inputs #2

apcamargo opened this issue Oct 16, 2024 · 2 comments
Assignees

Comments

@apcamargo
Copy link

I generated a large similarity table from ~1 million genomes using sourmash'es branchwater and tried to cluster it with Clusty. Clusty didn't finish reading the input after 6 hours, while pyLeiden finished reading everything within ~30 min.

Is there a reason Clusty is taking so much time to read the input?

@agudys agudys self-assigned this Oct 16, 2024
@agudys agudys added bug Something isn't working and removed bug Something isn't working labels Oct 16, 2024
@agudys
Copy link
Member

agudys commented Oct 16, 2024

Hello!

We manged to run Clusty on distance tables having tens of gigabytes so it seems there is some issue which made Clusty hung on your dataset. How large is your data file and how many distances it contains? Could you please provide me with at least part of it?

Best,
Adam

@apcamargo
Copy link
Author

I don't have the original input anymore, but a filtered version (which Clusty is also taking a long time to read) is ~34GB with 401,294,724 lines (not counting the header).

Here's a sample with 500k lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants