-
-
Notifications
You must be signed in to change notification settings - Fork 312
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* first version of mean normalization * augment coverage * changes after review * add new tests and fix after review * second update after discussion * add mean normalization to the docs * improve docstrings * devide _params into _mean and _var * deleted formula from docstring * add scaling into index * fix flake8 * Update docs/index.rst Co-authored-by: Soledad Galli <[email protected]> * Update docs/index.rst Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * change to dictionaries * update docs with demo * fix * fix * fix * minor rewording here and there --------- Co-authored-by: Soledad Galli <[email protected]>
- Loading branch information
1 parent
3dcc864
commit ca28618
Showing
12 changed files
with
588 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,7 @@ Other | |
:maxdepth: 1 | ||
|
||
preprocessing/index | ||
scaling/index | ||
wrappers/index | ||
|
||
Pipeline | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
MeanNormalizationScaler | ||
======================= | ||
|
||
.. autoclass:: feature_engine.scaling.MeanNormalizationScaler | ||
:members: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
.. -*- mode: rst -*- | ||
Scaling | ||
======= | ||
|
||
Feature-engine's scaling transformers apply various scaling techniques to | ||
given columns | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
MeanNormalizationScaler |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
.. _mean_normalization_scaler: | ||
|
||
.. currentmodule:: feature_engine.scaling | ||
|
||
MeanNormalizationScaler | ||
======================= | ||
|
||
:class:`MeanNormalizationScaler()` scales variables using mean normalization. With mean normalization, | ||
we center the distribution around 0, and rescale the distribution to the variable's value range, | ||
so that its values vary between -1 and 1. This is accomplished by subtracting the mean of the feature | ||
and then dividing by its range (i.e., the difference between the maximum and minimum values). | ||
|
||
The :class:`MeanNormalizationScaler()` only works with non-constant numerical variables. | ||
If the variable is constant, the scaler will raise an error. | ||
|
||
Python example | ||
-------------- | ||
|
||
We'll show how to use :class:`MeanNormalizationScaler()` through a toy dataset. Let's create | ||
a toy dataset: | ||
|
||
.. code:: python | ||
import pandas as pd | ||
from feature_engine.scaling import MeanNormalizationScaler | ||
df = pd.DataFrame.from_dict( | ||
{ | ||
"Name": ["tom", "nick", "krish", "jack"], | ||
"City": ["London", "Manchester", "Liverpool", "Bristol"], | ||
"Age": [20, 21, 19, 18], | ||
"Height": [1.80, 1.77, 1.90, 2.00], | ||
"Marks": [0.9, 0.8, 0.7, 0.6], | ||
"dob": pd.date_range("2020-02-24", periods=4, freq="min"), | ||
}) | ||
print(df) | ||
The dataset looks like this: | ||
|
||
.. code:: python | ||
Name City Age Height Marks dob | ||
0 tom London 20 1.80 0.9 2020-02-24 00:00:00 | ||
1 nick Manchester 21 1.77 0.8 2020-02-24 00:01:00 | ||
2 krish Liverpool 19 1.90 0.7 2020-02-24 00:02:00 | ||
3 jack Bristol 18 2.00 0.6 2020-02-24 00:03:00 | ||
We see that the only numerical features in this dataset are **Age**, **Marks**, and **Height**. We want | ||
to scale them using mean normalization. | ||
|
||
First, let's make a list with the variable names: | ||
|
||
.. code:: python | ||
vars = [ | ||
'Age', | ||
'Marks', | ||
'Height', | ||
] | ||
Now, let's set up :class:`MeanNormalizationScaler()`: | ||
|
||
.. code:: python | ||
# set up the scaler | ||
scaler = MeanNormalizationScaler(variables = vars) | ||
# fit the scaler | ||
scaler.fit(df) | ||
The scaler learns the mean of every column in *vars* and their respective range. | ||
Note that we can access these values in the following way: | ||
|
||
.. code:: python | ||
# access the parameters learned by the scaler | ||
print(f'Means: {scaler.mean_}') | ||
print(f'Ranges: {scaler.range_}') | ||
We see the features' mean and value ranges in the following output: | ||
|
||
.. code:: python | ||
Means: {'Age': 19.5, 'Marks': 0.7500000000000001, 'Height': 1.8675000000000002} | ||
Ranges: {'Age': 3.0, 'Marks': 0.30000000000000004, 'Height': 0.22999999999999998} | ||
We can now go ahead and scale the variables: | ||
|
||
.. code:: python | ||
# scale the data | ||
df = scaler.transform(df) | ||
print(df) | ||
In the following output, we can see the scaled variables: | ||
|
||
.. code:: python | ||
Name City Age Height Marks dob | ||
0 tom London 0.166667 -0.293478 0.500000 2020-02-24 00:00:00 | ||
1 nick Manchester 0.500000 -0.423913 0.166667 2020-02-24 00:01:00 | ||
2 krish Liverpool -0.166667 0.141304 -0.166667 2020-02-24 00:02:00 | ||
3 jack Bristol -0.500000 0.576087 -0.500000 2020-02-24 00:03:00 | ||
We can restore the data to itsoriginal values using the inverse transformation: | ||
|
||
.. code:: python | ||
# inverse transform the dataframe | ||
df = scaler.inverse_transform(df) | ||
print(df) | ||
In the following data, we see the scaled variables returned to their oridinal representation: | ||
|
||
.. code:: python | ||
Name City Age Height Marks dob | ||
0 tom London 20 1.80 0.9 2020-02-24 00:00:00 | ||
1 nick Manchester 21 1.77 0.8 2020-02-24 00:01:00 | ||
2 krish Liverpool 19 1.90 0.7 2020-02-24 00:02:00 | ||
3 jack Bristol 18 2.00 0.6 2020-02-24 00:03:00 | ||
Additional resources | ||
-------------------- | ||
|
||
For more details about this and other feature engineering methods check out | ||
these resources: | ||
|
||
|
||
.. figure:: ../../images/feml.png | ||
:width: 300 | ||
:figclass: align-center | ||
:align: left | ||
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning | ||
|
||
Feature Engineering for Machine Learning | ||
|
||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
Or read our book: | ||
|
||
.. figure:: ../../images/cookbook.png | ||
:width: 200 | ||
:figclass: align-center | ||
:align: left | ||
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 | ||
|
||
Python Feature Engineering Cookbook | ||
|
||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
| | ||
Both our book and course are suitable for beginners and more advanced data scientists | ||
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
.. -*- mode: rst -*- | ||
.. _scaling_user_guide: | ||
|
||
.. currentmodule:: feature_engine.scaling | ||
|
||
Scaling | ||
======= | ||
|
||
`Feature scaling <https://www.blog.trainindata.com/feature-scaling-in-machine-learning/>`_ | ||
is the process of transforming the range of numerical features so that they fit within a | ||
specific scale, usually to improve the performance and training stability of machine learning | ||
models. | ||
|
||
Scaling helps to normalize the input data, ensuring that each feature contributes proportionately | ||
to the final result, particularly in algorithms that are sensitive to the range of the data, | ||
such as gradient descent-based models (e.g., linear regression, logistic regression, neural networks) | ||
and distance-based models (e.g., K-nearest neighbors, clustering). | ||
|
||
Feature-engine's scalers replace the variables' values by the scaled ones. In this page, we | ||
discuss the importance of scaling numerical features, and then introduce the various | ||
scaling techniques supported by Feature-engine. | ||
|
||
Importance of scaling | ||
--------------------- | ||
|
||
Scaling is crucial in machine learning as it ensures that features contribute equally to model | ||
training, preventing bias toward variables with larger ranges. Properly scaled data enhances the | ||
performance of algorithms sensitive to the magnitude of input values, such as gradient descent | ||
and distance-based methods. Additionally, scaling can improve convergence speed and overall model | ||
accuracy, leading to more reliable predictions. | ||
|
||
|
||
When apply scaling | ||
------------------ | ||
|
||
- **Training:** Most machine learning algorithms require data to be scaled before training, | ||
especially linear models, neural networks, and distance-based models. | ||
|
||
- **Feature Engineering:** Scaling can be essential for certain feature engineering techniques, | ||
like polynomial features. | ||
|
||
- **Resampling:** Some oversampling methods like SMOTE and many of the undersampling methods | ||
clean data based on KNN algorithms, which are distance based models. | ||
|
||
|
||
When Scaling Is Not Necessary | ||
----------------------------- | ||
|
||
Not all algorithms require scaling. For example, tree-based algorithms (like Decision Trees, | ||
Random Forests, Gradient Boosting) are generally invariant to scaling because they split data | ||
based on the order of values, not the magnitude. | ||
|
||
Scalers | ||
------- | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
MeanNormalizationScaler |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
""" | ||
The module scaling includes classes to transform variables using various | ||
scaling methods. | ||
""" | ||
|
||
from .mean_normalization import MeanNormalizationScaler | ||
|
||
__all__ = [ | ||
"MeanNormalizationScaler", | ||
] |
Oops, something went wrong.