-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask: KMeans #6277
Dask: KMeans #6277
Conversation
252d8fa
to
928fb38
Compare
# Do not needlessly recluster the data if X hasn't changed | ||
if old_data and self.data and array_equal(self.data.X, old_data.X): | ||
if old_data and self.data and array_equal(self.data.X, old_data.X): # could be an issue for dask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good find. I remember you telling me about it, but then I forgot. I do not have a good solution for now, but we would need to ensure (and test) at the minimum that array_equal
does not load the whole data set into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out np.allclose works fine on dask. Comparing an 8GB array against itself took a little over two seconds (this is worst case scenario, runing in an ipython shell that doesn't play well with dask) and memory usage stayed very much within reasonable limits.
752c917
to
6351ae2
Compare
Please add Next, this does not actually use context settings: they would always need a context handler (in this case it should detect whether you have a dask table or not), then they need a openContext and closeContext calls. Maybe we do not need them here and they were a bad idea I had... But what needs to work is the following:
From the code I'd guess this does not work at least for some possible O. :) And then we have to remember that we still need settings migration. |
Another thing: this is only available for Dask, so it will be off for majority of users. If so, I would disable it when inapplicable. If chosen, I would keep it chosen (but disabled!), and have a warning etc. So the user may manually choose another option, or keep the disabled one, which won't work anyway. An alternative to above would be to hide the option if it is unavailable and not chosen. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## dask #6277 +/- ##
=======================================
Coverage 87.64% 87.65%
=======================================
Files 322 322
Lines 69601 69637 +36
=======================================
+ Hits 61002 61040 +38
+ Misses 8599 8597 -2 |
e194f30
to
42f9879
Compare
3a1a379
to
2de0510
Compare
42f9879
to
554aed6
Compare
5571f8e
to
8e15f49
Compare
k = Setting(3) | ||
k_from = Setting(2) | ||
k_to = Setting(8) | ||
optimize_k = Setting(False) | ||
max_iterations = Setting(300) | ||
n_init = Setting(10) | ||
smart_init = Setting(0) # KMeans++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change, if we go with it, also requires settings migration. Just leaving it here as a note.
8e15f49
to
a0acc61
Compare
a0acc61
to
1c7206a
Compare
7ee6d02
to
0acd48b
Compare
Thanks. We need at least some basic tests for this. |
e01318f
to
b13b1d8
Compare
Dask: KMeans
Dask: KMeans
Add
dask_ml.cluster.KMeans
as an alternative tosklearn.cluster.KMeans
for dask arrays.Probably worth mentioning, dask_ml adds the "k-means||" init option but defaults to the sklearn implementation in the case of "k-means++" or "random" initialization. So using "k-means||" is what enables working with larger datasets at all. However, smaller datasets are still processed much faster just using sklearn directly as there seems to be a lot of overhead with dask_ml.