Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1M features x 60M rows? #3

Open
Tagar opened this issue Jan 29, 2019 · 2 comments
Open

1M features x 60M rows? #3

Tagar opened this issue Jan 29, 2019 · 2 comments

Comments

@Tagar
Copy link

Tagar commented Jan 29, 2019

Would this library scale to

  • over 1M features (sparse, average population of features is around 12%);
  • over 60M rows.

https://stats.stackexchange.com/questions/355260/distributed-pca-or-an-equivalent

Thank you.

@alois-bissuel
Copy link
Contributor

If I understand well, the sparsity of your data is around 10%.
Our library is routinely used to decompose matrices of size 100M x 100M, but much more sparse.
I do not see any reason why the lib should not work, though one would need to tweak the parameters a bit.
Please adjust block size and number of blocks per partition so that each partition of the matrix and of the dense embeddings are less than 2Gb (for the definition of block and number of blocks per partition, see our article on medium), and start with a tiny embedding size (100 for instance).

@Tagar
Copy link
Author

Tagar commented Jan 31, 2019

Thanks a lot @alois-bissuel

We will definitely give this distributed Spark-RSVD library a try!
Those tuning recommendations will be very handy and helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants