-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address incosistencies in DataScaling infrastructure #598
Address incosistencies in DataScaling infrastructure #598
Conversation
…his in the CI to check that nothing breaks
@elcorto I hope I did not miss anything, but I think this should address the issues you have raised. I would appreciate your opinion/review. |
It seems like you are still working on this PR? If so, please let me know when it is ready for a review. |
As of today I was actually finished with the PR from my side, except of course if I missed something you wanted implemented? |
I saw recent commits so I wanted to wait with the review, but OK then I'll look into it now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one large comment that I think should be discussed and two small changes, the rest looks good, thanks a lot.
* ``feature-wise-standard``: Row Standardization (Scale to mean 0, standard deviation 1) | ||
* ``feature-wise-standard``: Standardization (Scale to mean 0, standard | ||
deviation 1) is applied to each feature dimension individually. I.e., if your | ||
training data has dimensions (x,y,z,f), then each of the f rows with (x,y,z) | ||
entries is scaled indiviually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #482, DataScaler
silently assumes a 2D array of shape (x * y * z, f)
since it does torch.mean(unscaled, 0, ...)
, so unless I'm overlooking something obvious, I think this explanation is actually misleading.
Consider this:
from sklearn.preprocessing import StandardScaler
from mala.datahandling.data_scaler import DataScaler
import torch as T
import einops
import numpy as np
from icecream import ic
reshape_to_2d = lambda x: einops.rearrange(x, "x y z f -> (x y z) f")
T.set_default_dtype(T.float64)
# sklearn's StandardScaler uses a biased estimator np.std(..., ddof=0), while
# torch uses T.std(..., correction=1) and there is no way to change this in
# StandardScaler, so we roll our own.
class MyStandardScaler:
@staticmethod
def fit_transform(X):
assert X.ndim == 2
x_mean = X.mean(0)[None, :]
x_std = np.std(X.numpy(), axis=0, ddof=1)[None, :]
return (X - x_mean) / x_std
with T.no_grad():
arr_4d = T.rand(2, 3, 4, 5)
arr_2d = reshape_to_2d(arr_4d)
##arr_2d_scaled_skl = StandardScaler().fit_transform(arr_2d)
arr_2d_scaled_skl = MyStandardScaler().fit_transform(arr_2d)
scaler_mala = DataScaler("feature-wise-standard")
arr_2d_scaled_mala = arr_2d.clone()
scaler_mala.fit(arr_2d_scaled_mala)
# in-place mod!
scaler_mala.transform(arr_2d_scaled_mala)
ic(arr_2d_scaled_skl[:5, :])
ic(arr_2d_scaled_mala[:5, :])
assert np.allclose(arr_2d_scaled_mala, arr_2d_scaled_skl, atol=1e-15)
# ValueError: Found array with dim 4. StandardScaler expected <= 2.
##arr_2d_scaled_skl = StandardScaler().fit_transform(arr_2d)
# This works but shouldn't
scaler_mala = DataScaler("feature-wise-standard")
arr_4d_scaled_mala = arr_4d.clone()
scaler_mala.fit(arr_4d_scaled_mala)
# in-place mod!
scaler_mala.transform(arr_4d_scaled_mala)
arr_2d_from_4d_scaled_mala = reshape_to_2d(arr_4d_scaled_mala)
assert not np.allclose(
arr_2d_from_4d_scaled_mala, arr_2d_scaled_skl, atol=1e-1
)
ic(arr_2d_from_4d_scaled_mala[:5, :])
results in:
ic| arr_2d_scaled_skl[:5, :]: tensor([[-0.8424, -0.3500, -1.3310, 0.6888, -0.2759],
[-0.1266, -1.1469, -0.2332, -0.3831, -0.7441],
[-1.4691, 1.4810, 0.2890, 0.3123, -1.2043],
[ 1.5169, -0.5278, -1.4609, 2.1362, 1.1646],
[-0.6087, 0.2314, 0.9764, -1.0315, 0.6824]])
ic| arr_2d_scaled_mala[:5, :]: tensor([[-0.8424, -0.3500, -1.3310, 0.6888, -0.2759],
[-0.1266, -1.1469, -0.2332, -0.3831, -0.7441],
[-1.4691, 1.4810, 0.2890, 0.3123, -1.2043],
[ 1.5169, -0.5278, -1.4609, 2.1362, 1.1646],
[-0.6087, 0.2314, 0.9764, -1.0315, 0.6824]])
ic| arr_2d_from_4d_scaled_mala[:5, :]: tensor([[ 0.7071, -0.7071, -0.7071, 0.7071, 0.7071],
[-0.7071, -0.7071, -0.7071, 0.7071, -0.7071],
[-0.7071, 0.7071, -0.7071, -0.7071, -0.7071],
[ 0.7071, -0.7071, -0.7071, 0.7071, 0.7071],
[ 0.7071, 0.7071, 0.7071, -0.7071, -0.7071]])
So the code doesn't fail when given a 4D array but produces wrong results. Therefore, all docs mentioning (x,y,z,f)
should be adapted. Also an array dimension check is probably a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaaaah, I see what you mean now, sorry for misunderstanding. I have adapted the documentation and added an array check. Let me know if there is still something missing!
* ``feature-wise-minmax``: Min-Max scaling (Scale to be in range 0...1) is | ||
applied to each feature dimension individually. I.e., if your training data | ||
has dimensions (x,y,z,f), then each of the f rows with (x,y,z) entries is | ||
scaled indiviually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
mala/common/parameters.py
Outdated
standard deviation 1) is applied to each feature dimension | ||
individually. I.e., if your training data has dimensions | ||
(x,y,z,f), then each of the f rows with (x,y,z) entries is scaled | ||
indiviually. | ||
- "feature-wise-minmax": Row Min-Max scaling (Scale to be in range | ||
0...1) is applied to each feature dimension individually. | ||
I.e., if your training data has dimensions (x,y,z,f), then each | ||
of the f rows with (x,y,z) entries is scaled indiviually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
mala/common/parameters.py
Outdated
individually. I.e., if your training data has dimensions | ||
(x,y,z,f), then each of the f rows with (x,y,z) entries is scaled | ||
indiviually. | ||
- "feature-wise-minmax": Row Min-Max scaling (Scale to be in range | ||
0...1) is applied to each feature dimension individually. | ||
I.e., if your training data has dimensions (x,y,z,f), then each | ||
of the f rows with (x,y,z) entries is scaled indiviually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
mala/datahandling/data_scaler.py
Outdated
I.e., if your training data has dimensions (x,y,z,f), then each | ||
of the f rows with (x,y,z) entries is scaled indiviually. | ||
- "feature-wise-minmax": Min-Max scaling (Scale to be in range | ||
0...1) is applied to each feature dimension individually. | ||
I.e., if your training data has dimensions (x,y,z,f), then each | ||
of the f rows with (x,y,z) entries is scaled indiviually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
Co-authored-by: Steve Schmerler <[email protected]>
Co-authored-by: Steve Schmerler <[email protected]>
I fixed the pipeline @elcorto after addressing your comments, could you let me know if this PR looks good from your side now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the doc update in DataScaler
and the 2d array check.
Two things that I stumbled upon:
-
There are still places (see https://github.com/mala-project/mala/pull/598/files) that mention
(x,y,z,f)
that you probably missed to catch, namely indocs/source/basic_usage/trainingmodel.rst
andmala/common/parameters.py
. -
The docs in
DataScaler
andmala/common/parameters.py
are identical. Maybe just keep them in one place instead, so for instance inparameters.py
link to :class:~mala.datahandling.data_handler.DataScaler
?
Co-authored-by: Steve Schmerler <[email protected]>
Co-authored-by: Steve Schmerler <[email protected]>
Thanks for the feedback and suggestions @elcorto , I implemented the changes and corrected (x,y,z) to (d) in the places you mentioned. I would keep the full doc strings in both the Parameters and DataScaler classes though. I think they are there for different audiences: a regular user who applies a model does not necessarily interact with the DataScaler class, so they need to be in the Parameters class. If a developer however wants to adapt the DataScaler code the proper doc string should be in the right place. I will add a note that changes in DataScaler should be propagated to the Parameters class though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK then we're all set, I guess. Thanks a lot!
This PR will close #482 and should also close #483.
sklearn
scalers