Dask: KMeans #6277

noahnovsak · 2023-01-05T10:27:08Z

Add dask_ml.cluster.KMeans as an alternative to sklearn.cluster.KMeans for dask arrays.

Probably worth mentioning, dask_ml adds the "k-means||" init option but defaults to the sklearn implementation in the case of "k-means++" or "random" initialization. So using "k-means||" is what enables working with larger datasets at all. However, smaller datasets are still processed much faster just using sklearn directly as there seems to be a lot of overhead with dask_ml.

Orange/clustering/kmeans.py

markotoplak · 2023-01-17T09:41:10Z

Orange/widgets/unsupervised/owkmeans.py

        # Do not needlessly recluster the data if X hasn't changed
-        if old_data and self.data and array_equal(self.data.X, old_data.X):
+        if old_data and self.data and array_equal(self.data.X, old_data.X):  # could be an issue for dask


This is a good find. I remember you telling me about it, but then I forgot. I do not have a good solution for now, but we would need to ensure (and test) at the minimum that array_equal does not load the whole data set into memory.

Turns out np.allclose works fine on dask. Comparing an 8GB array against itself took a little over two seconds (this is worst case scenario, runing in an ipython shell that doesn't play well with dask) and memory usage stayed very much within reasonable limits.

Orange/widgets/unsupervised/owkmeans.py

markotoplak · 2023-01-25T09:20:01Z

Please add dask-ml to requirements (that means tox.ini as the oldest version specification, requirements-?.txt, and meta.yaml).

Next, this does not actually use context settings: they would always need a context handler (in this case it should detect whether you have a dask table or not), then they need a openContext and closeContext calls.

Maybe we do not need them here and they were a bad idea I had... But what needs to work is the following:

use kmeans with ordinary data, select any option, lets call it O
change input to dask data, select anything
change back to ordinary data, the option O needs to be selected

From the code I'd guess this does not work at least for some possible O. :)

And then we have to remember that we still need settings migration.

janezd · 2023-01-25T10:12:09Z

In Sample Data we have an checkbox Stratify (if possible). Is this an option here? Combos (I would actually prefer radio buttons, but this may be my current temporary preference) and the one that is not always applicable would indicate what it defaults to. At least in a tooltip.

Option "O" would remain chosen, but a warning would indicate that it was not used.

Another thing: this is only available for Dask, so it will be off for majority of users. If so, I would disable it when inapplicable. If chosen, I would keep it chosen (but disabled!), and have a warning etc. So the user may manually choose another option, or keep the disabled one, which won't work anyway.

An alternative to above would be to hide the option if it is unavailable and not chosen.

codecov · 2023-01-25T12:56:14Z

Codecov Report

Merging #6277 (ace9abb) into dask (2f0bec2) will increase coverage by 0.00%.
The diff coverage is 95.23%.

Additional details and impacted files

@@           Coverage Diff           @@
##             dask    #6277   +/-   ##
=======================================
  Coverage   87.64%   87.65%           
=======================================
  Files         322      322           
  Lines       69601    69637   +36     
=======================================
+ Hits        61002    61040   +38     
+ Misses       8599     8597    -2

markotoplak · 2023-02-03T12:33:16Z

Orange/widgets/unsupervised/owkmeans.py

    k = Setting(3)
    k_from = Setting(2)
    k_to = Setting(8)
    optimize_k = Setting(False)
    max_iterations = Setting(300)
    n_init = Setting(10)
-    smart_init = Setting(0)  # KMeans++


This change, if we go with it, also requires settings migration. Just leaving it here as a note.

markotoplak · 2023-06-09T13:48:46Z

Thanks. We need at least some basic tests for this.

Dask: KMeans

markotoplak force-pushed the dask branch from b2cbdff to 453b144 Compare January 11, 2023 10:59

noahnovsak force-pushed the kmeans-dask branch from 252d8fa to 928fb38 Compare January 17, 2023 08:33

noahnovsak requested a review from markotoplak January 17, 2023 09:07

markotoplak reviewed Jan 17, 2023

View reviewed changes

Orange/clustering/kmeans.py Outdated Show resolved Hide resolved

markotoplak reviewed Jan 17, 2023

View reviewed changes

Orange/widgets/unsupervised/owkmeans.py Outdated Show resolved Hide resolved

markotoplak force-pushed the dask branch from 56846a5 to 8d7bb78 Compare January 17, 2023 12:32

noahnovsak force-pushed the kmeans-dask branch 2 times, most recently from 752c917 to 6351ae2 Compare January 20, 2023 10:02

noahnovsak force-pushed the kmeans-dask branch from e194f30 to 42f9879 Compare January 30, 2023 14:57

markotoplak force-pushed the dask branch 2 times, most recently from 3a1a379 to 2de0510 Compare February 1, 2023 20:53

noahnovsak force-pushed the kmeans-dask branch from 42f9879 to 554aed6 Compare February 2, 2023 08:38

markotoplak force-pushed the dask branch from 2de0510 to d345f84 Compare February 2, 2023 21:28

noahnovsak force-pushed the kmeans-dask branch 2 times, most recently from 5571f8e to 8e15f49 Compare February 3, 2023 10:24

markotoplak reviewed Feb 3, 2023

View reviewed changes

noahnovsak force-pushed the kmeans-dask branch from 8e15f49 to a0acc61 Compare February 20, 2023 13:23

pavlin-policar mentioned this pull request Feb 20, 2023

Failing tests #6339

Closed

noahnovsak force-pushed the kmeans-dask branch from a0acc61 to 1c7206a Compare February 21, 2023 10:11

markotoplak force-pushed the dask branch 2 times, most recently from 7ee6d02 to 0acd48b Compare March 24, 2023 12:51

noahnovsak force-pushed the kmeans-dask branch from 682aa38 to 2d9313f Compare May 12, 2023 13:13

noahnovsak marked this pull request as ready for review May 12, 2023 16:15

noahnovsak force-pushed the kmeans-dask branch from 2d9313f to 2733715 Compare May 15, 2023 14:58

noahnovsak force-pushed the kmeans-dask branch from 2733715 to 864a936 Compare May 25, 2023 11:28

janezd assigned markotoplak May 26, 2023

noahnovsak force-pushed the kmeans-dask branch 2 times, most recently from e01318f to b13b1d8 Compare June 22, 2023 12:28

noahnovsak added 2 commits June 22, 2023 16:59

dask compatible kmeans

fe3d504

add widget tests

45cc748

noahnovsak force-pushed the kmeans-dask branch from b13b1d8 to 4392523 Compare June 22, 2023 14:59

add kmeans tests

ace9abb

noahnovsak force-pushed the kmeans-dask branch from 4392523 to ace9abb Compare June 23, 2023 08:40

markotoplak added the dask Related (discovered in or needed) to the Dask adaptation label Jun 28, 2023

markotoplak merged commit f25a2a9 into biolab:dask Jun 28, 2023

markotoplak added a commit that referenced this pull request Jun 28, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

9c7c014

Dask: KMeans

noahnovsak deleted the kmeans-dask branch July 4, 2023 08:54

markotoplak added a commit that referenced this pull request Jul 12, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

82e8b7b

Dask: KMeans

markotoplak added a commit that referenced this pull request Jul 14, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

1744878

Dask: KMeans

markotoplak added a commit that referenced this pull request Jul 20, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

a92208a

Dask: KMeans

markotoplak added a commit to markotoplak/orange3 that referenced this pull request Jul 26, 2023

Merge pull request biolab#6277 from noahnovsak/kmeans-dask

94d2206

Dask: KMeans

markotoplak added a commit that referenced this pull request Aug 15, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

2efed98

Dask: KMeans

markotoplak added a commit that referenced this pull request Aug 17, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

7c73919

Dask: KMeans

markotoplak added a commit that referenced this pull request Aug 21, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

697cb62

Dask: KMeans

markotoplak added a commit that referenced this pull request Sep 4, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

1ef8c95

Dask: KMeans

markotoplak added a commit that referenced this pull request Sep 14, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

9f60fd5

Dask: KMeans

markotoplak added a commit to markotoplak/orange3 that referenced this pull request Sep 14, 2023

Merge pull request biolab#6277 from noahnovsak/kmeans-dask

5e2c14d

Dask: KMeans

markotoplak added a commit that referenced this pull request Sep 18, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

e9ca711

Dask: KMeans

markotoplak added a commit that referenced this pull request Sep 26, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

a900924

Dask: KMeans

markotoplak added a commit that referenced this pull request Oct 10, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

4aae16b

Dask: KMeans

markotoplak added a commit that referenced this pull request Oct 13, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

01b5347

Dask: KMeans

markotoplak added a commit that referenced this pull request Oct 21, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

82afac5

Dask: KMeans

markotoplak added a commit that referenced this pull request Oct 29, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

3e1c079

Dask: KMeans

markotoplak added a commit that referenced this pull request Nov 6, 2023

Merge pull request #6277 from noahnovsak/kmeans-dask

886f8c0

Dask: KMeans

markotoplak added a commit that referenced this pull request Jan 23, 2024

Merge pull request #6277 from noahnovsak/kmeans-dask

1fd06d4

Dask: KMeans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask: KMeans #6277

Dask: KMeans #6277

noahnovsak commented Jan 5, 2023 •

edited

Loading

markotoplak Jan 17, 2023

noahnovsak Feb 20, 2023

markotoplak commented Jan 25, 2023

janezd commented Jan 25, 2023

codecov bot commented Jan 25, 2023 •

edited

Loading

markotoplak Feb 3, 2023

markotoplak commented Jun 9, 2023

Dask: KMeans #6277

Dask: KMeans #6277

Conversation

noahnovsak commented Jan 5, 2023 • edited Loading

markotoplak Jan 17, 2023

Choose a reason for hiding this comment

noahnovsak Feb 20, 2023

Choose a reason for hiding this comment

markotoplak commented Jan 25, 2023

janezd commented Jan 25, 2023

codecov bot commented Jan 25, 2023 • edited Loading

Codecov Report

markotoplak Feb 3, 2023

Choose a reason for hiding this comment

markotoplak commented Jun 9, 2023

noahnovsak commented Jan 5, 2023 •

edited

Loading

codecov bot commented Jan 25, 2023 •

edited

Loading