Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows large data? #5

Open
ozgunakalin opened this issue Dec 7, 2016 · 2 comments
Open

Allows large data? #5

ozgunakalin opened this issue Dec 7, 2016 · 2 comments

Comments

@ozgunakalin
Copy link

Hello

I currently use sklearn's TSNE, and it is not very memory friendly. I wonder how this project compares to that one in terms of the rows in the data it can handle. Thanks.

@saurfang
Copy link
Owner

That was the hope then I found that I need a scalable knn implementation, which distracted me to work on https://github.com/saurfang/spark-knn. Unfortunately I no longer have time pursuing this project. However I am happy to answer any questions or review any contributions.

@kartha01
Copy link

kartha01 commented Jun 26, 2017

Curious, has anyone been able to run this on large datasets. Was wondering what are the issues
you ran into and approx run times

I am using a 3GB dataset with 100 features and so far had to update the following properties with new
values:

"spark.rpc.askTimeout=1000"
"spark.akka.frameSize=256"
"spark.driver.maxResultSize=2G"

to fix the exceptions, I ran into. Also, the driver and executor needs to have lots of memory, I am using 10G for each (with 12 executors) and the t-SNE is still running after about 14 hrs...

I am using the same approach as shown in the MNIST.scala example:
com/github/saurfang/spark/tsne/examples/MNIST.scala

Any thoughts/ideas on speeding this up....

Regards,
Rajesh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants