-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] SubarrayComputeValue for faster domain transformation #6520
Conversation
WITH BRANCH [run_dask] with 3 loops, best of 3: min 910 msec per loop avg 943 msec per loop [run_dense] with 3 loops, best of 3: min 858 msec per loop avg 888 msec per loop [transform_dask] with 3 loops, best of 3: min 44.4 msec per loop avg 44.9 msec per loop [transform_dask_values] with 3 loops, best of 3: min 258 msec per loop avg 407 msec per loop [transform_dense] with 3 loops, best of 3: min 600 msec per loop avg 629 msec per loop [run_dask] with 3 loops, best of 3: min 481 msec per loop avg 504 msec per loop [run_dense] with 3 loops, best of 3: min 669 msec per loop avg 695 msec per loop [transform_dask] with 3 loops, best of 3: min 31.7 msec per loop avg 31.8 msec per loop [transform_dask_values] with 3 loops, best of 3: min 327 msec per loop avg 349 msec per loop [transform_dense] with 3 loops, best of 3: min 342 msec per loop avg 365 msec per loop [run_dask] with 3 loops, best of 3: min 1.08 sec per loop avg 1.29 sec per loop [run_dense] with 3 loops, best of 3: min 1.31 sec per loop avg 1.34 sec per loop [transform_dask] with 3 loops, best of 3: min 45.6 msec per loop avg 46 msec per loop [transform_dask_values] with 3 loops, best of 3: min 430 msec per loop avg 589 msec per loop [transform_dense] with 3 loops, best of 3: min 583 msec per loop avg 639 msec per loop [run_dask] with 3 loops, best of 3: min 203 msec per loop avg 235 msec per loop [run_dense] with 3 loops, best of 3: min 476 msec per loop avg 529 msec per loop [transform_dask] with 3 loops, best of 3: min 30.4 msec per loop avg 31 msec per loop [transform_dask_values] with 3 loops, best of 3: min 85.1 msec per loop avg 174 msec per loop [transform_dense] with 3 loops, best of 3: min 262 msec per loop avg 271 msec per loop [normalize_only_parameters] with 5 loops, best of 3: min 53.4 msec per loop avg 54.7 msec per loop [normalize_only_transform] with 5 loops, best of 3: min 35.7 msec per loop avg 35.9 msec per loop [sklimpute] with 5 loops, best of 3: min 65.4 msec per loop avg 66.3 msec per loop BEFORE [run_dask] with 3 loops, best of 3: min 17 sec per loop avg 18 sec per loop [run_dense] with 3 loops, best of 3: min 1.76 sec per loop avg 1.83 sec per loop [transform_dask] with 3 loops, best of 3: min 3.67 sec per loop avg 3.72 sec per loop [transform_dask_values] with 3 loops, best of 3: min 1.55 sec per loop avg 1.57 sec per loop [transform_dense] with 3 loops, best of 3: min 1.98 sec per loop avg 1.99 sec per loop [run_dask] with 3 loops, best of 3: min 2.6 sec per loop avg 2.66 sec per loop [run_dense] with 3 loops, best of 3: min 1.08 sec per loop avg 1.08 sec per loop [transform_dask] with 3 loops, best of 3: min 2.08 sec per loop avg 2.08 sec per loop [transform_dask_values] with 3 loops, best of 3: min 1.02 sec per loop avg 1.04 sec per loop [transform_dense] with 3 loops, best of 3: min 763 msec per loop avg 765 msec per loop [run_dask] with 3 loops, best of 3: min 14.1 sec per loop avg 14.4 sec per loop [run_dense] with 3 loops, best of 3: min 1.95 sec per loop avg 1.98 sec per loop [transform_dask] with 3 loops, best of 3: min 3.74 sec per loop avg 3.76 sec per loop [transform_dask_values] with 3 loops, best of 3: min 1.51 sec per loop avg 1.6 sec per loop [transform_dense] with 3 loops, best of 3: min 1.91 sec per loop avg 1.93 sec per loop [run_dask] with 3 loops, best of 3: min 1.74 sec per loop avg 1.85 sec per loop [run_dense] with 3 loops, best of 3: min 1.01 sec per loop avg 1.02 sec per loop [transform_dask] with 3 loops, best of 3: min 1.6 sec per loop avg 1.63 sec per loop [transform_dask_values] with 3 loops, best of 3: min 1 sec per loop avg 1.02 sec per loop [transform_dense] with 3 loops, best of 3: min 846 msec per loop avg 865 msec per loop [normalize_only_parameters] with 5 loops, best of 3: min 55.5 msec per loop avg 55.8 msec per loop [normalize_only_transform] with 5 loops, best of 3: min 118 msec per loop avg 119 msec per loop [sklimpute] with 5 loops, best of 3: min 154 msec per loop avg 157 msec per loop
Benchmark resultsWITH BRANCH [run_dask] with 3 loops, best of 3: [run_dask] with 3 loops, best of 3: [run_dask] with 3 loops, best of 3: [run_dask] with 3 loops, best of 3: [normalize_only_parameters] with 5 loops, best of 3: BEFORE [run_dask] with 3 loops, best of 3: [run_dask] with 3 loops, best of 3: [run_dask] with 3 loops, best of 3: [run_dask] with 3 loops, best of 3: [normalize_only_parameters] with 5 loops, best of 3: |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## dask #6520 +/- ##
==========================================
+ Coverage 87.67% 87.70% +0.03%
==========================================
Files 322 322
Lines 69765 69937 +172
==========================================
+ Hits 61164 61336 +172
Misses 8601 8601 |
After consultation with @noahnovsak and @lanzagar I am merging this into the |
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
[ENH] SubarrayComputeValue for faster domain transformation
Issue
Orange does domain transformation per column. Even columns computed with
SharedComputeValue
are filled in by column. This was especially slow for Dask.Description of changes
This PR tries to assign column groups when it can. When just copying columns between tables, chunks of columns are now used instead of single ones. This PR introduces
SubarrayComputeValue
that computes a subset of columns at once. This is aSharedComputeValue
with a limitation that shared results can not be further post-processed (asSharedComputeValue
was actually used most of the time).I implemented normalization and imputation with
SubarrayComputeValue
. Speedups with numpy tables are around 2x, and like 10x+ for dask tables.I did not test performance on sparse arrays though. I have a feeling that would have to be optimized.
Most code here is independent of dask and could be merged straight into master, but I think it is best tested on something less popular. :)
Includes