1M features x 60M rows? #3

Tagar · 2019-01-29T06:44:03Z

Would this library scale to

over 1M features (sparse, average population of features is around 12%);
over 60M rows.

https://stats.stackexchange.com/questions/355260/distributed-pca-or-an-equivalent

Thank you.

alois-bissuel · 2019-01-29T08:26:09Z

If I understand well, the sparsity of your data is around 10%.
Our library is routinely used to decompose matrices of size 100M x 100M, but much more sparse.
I do not see any reason why the lib should not work, though one would need to tweak the parameters a bit.
Please adjust block size and number of blocks per partition so that each partition of the matrix and of the dense embeddings are less than 2Gb (for the definition of block and number of blocks per partition, see our article on medium), and start with a tiny embedding size (100 for instance).

Tagar · 2019-01-31T21:01:42Z

Thanks a lot @alois-bissuel

We will definitely give this distributed Spark-RSVD library a try!
Those tuning recommendations will be very handy and helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1M features x 60M rows? #3

1M features x 60M rows? #3

Tagar commented Jan 29, 2019

alois-bissuel commented Jan 29, 2019

Tagar commented Jan 31, 2019

1M features x 60M rows? #3

1M features x 60M rows? #3

Comments

Tagar commented Jan 29, 2019

alois-bissuel commented Jan 29, 2019

Tagar commented Jan 31, 2019