Skip to content

Commit

Permalink
Add MeanNormalisationScaler (#806)
Browse files Browse the repository at this point in the history
* first version of mean normalization

* augment coverage

* changes after review

* add new tests and fix after review

* second update after discussion

* add mean normalization to the docs

* improve docstrings

* devide _params into _mean and _var

* deleted formula from docstring

* add scaling into index

* fix flake8

* Update docs/index.rst

Co-authored-by: Soledad Galli <[email protected]>

* Update docs/index.rst

Co-authored-by: Soledad Galli <[email protected]>

* Update feature_engine/scaling/mean_normalization.py

Co-authored-by: Soledad Galli <[email protected]>

* Update feature_engine/scaling/mean_normalization.py

Co-authored-by: Soledad Galli <[email protected]>

* Update feature_engine/scaling/mean_normalization.py

Co-authored-by: Soledad Galli <[email protected]>

* change to dictionaries

* update docs with demo

* fix

* fix

* fix

* minor rewording here and there

---------

Co-authored-by: Soledad Galli <[email protected]>
  • Loading branch information
VascoSch92 and solegalli authored Oct 12, 2024
1 parent 3dcc864 commit ca28618
Show file tree
Hide file tree
Showing 12 changed files with 588 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ Please share your story by answering 1 quick question
* Datetime Features
* Time Series
* Preprocessing
* Scaling
* Scikit-learn Wrappers

### Imputation Methods
Expand Down Expand Up @@ -110,6 +111,9 @@ Please share your story by answering 1 quick question
* BoxCoxTransformer
* YeoJohnsonTransformer

### Variable Scaling methods
* MeanNormalizationScaler

### Variable Creation:
* MathFeatures
* RelativeFeatures
Expand Down
1 change: 1 addition & 0 deletions docs/api_doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Other
:maxdepth: 1

preprocessing/index
scaling/index
wrappers/index

Pipeline
Expand Down
6 changes: 6 additions & 0 deletions docs/api_doc/scaling/MeanNormalizationScaler.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
MeanNormalizationScaler
=======================

.. autoclass:: feature_engine.scaling.MeanNormalizationScaler
:members:

12 changes: 12 additions & 0 deletions docs/api_doc/scaling/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
.. -*- mode: rst -*-
Scaling
=======

Feature-engine's scaling transformers apply various scaling techniques to
given columns

.. toctree::
:maxdepth: 1

MeanNormalizationScaler
10 changes: 10 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Feature-engine includes transformers for:
- Datetime features
- Time series
- Preprocessing
- Scaling

Feature-engine transformers are fully compatible with scikit-learn. That means that you can assemble Feature-engine
transformers within a Scikit-learn pipeline, or use them in a grid or random search for hyperparameters.
Expand Down Expand Up @@ -296,6 +297,15 @@ types and variable names match.
- :doc:`api_doc/preprocessing/MatchCategories`: ensures categorical variables are of type 'category'
- :doc:`api_doc/preprocessing/MatchVariables`: ensures that columns in test set match those in train set

Scaling:
~~~~~~~~

Scaling the data can help to balance the impact of all variables on the model, and can improve
its performance.

- :doc:`api_doc/scaling/MeanNormalizationScaler`: scale variables using mean normalization


Scikit-learn Wrapper:
~~~~~~~~~~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions docs/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Transformation
discretisation/index
outliers/index
transformation/index
scaling/index

Creation
--------
Expand Down
176 changes: 176 additions & 0 deletions docs/user_guide/scaling/MeanNormalizationScaler.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
.. _mean_normalization_scaler:

.. currentmodule:: feature_engine.scaling

MeanNormalizationScaler
=======================

:class:`MeanNormalizationScaler()` scales variables using mean normalization. With mean normalization,
we center the distribution around 0, and rescale the distribution to the variable's value range,
so that its values vary between -1 and 1. This is accomplished by subtracting the mean of the feature
and then dividing by its range (i.e., the difference between the maximum and minimum values).

The :class:`MeanNormalizationScaler()` only works with non-constant numerical variables.
If the variable is constant, the scaler will raise an error.

Python example
--------------

We'll show how to use :class:`MeanNormalizationScaler()` through a toy dataset. Let's create
a toy dataset:

.. code:: python
import pandas as pd
from feature_engine.scaling import MeanNormalizationScaler
df = pd.DataFrame.from_dict(
{
"Name": ["tom", "nick", "krish", "jack"],
"City": ["London", "Manchester", "Liverpool", "Bristol"],
"Age": [20, 21, 19, 18],
"Height": [1.80, 1.77, 1.90, 2.00],
"Marks": [0.9, 0.8, 0.7, 0.6],
"dob": pd.date_range("2020-02-24", periods=4, freq="min"),
})
print(df)
The dataset looks like this:

.. code:: python
Name City Age Height Marks dob
0 tom London 20 1.80 0.9 2020-02-24 00:00:00
1 nick Manchester 21 1.77 0.8 2020-02-24 00:01:00
2 krish Liverpool 19 1.90 0.7 2020-02-24 00:02:00
3 jack Bristol 18 2.00 0.6 2020-02-24 00:03:00
We see that the only numerical features in this dataset are **Age**, **Marks**, and **Height**. We want
to scale them using mean normalization.

First, let's make a list with the variable names:

.. code:: python
vars = [
'Age',
'Marks',
'Height',
]
Now, let's set up :class:`MeanNormalizationScaler()`:

.. code:: python
# set up the scaler
scaler = MeanNormalizationScaler(variables = vars)
# fit the scaler
scaler.fit(df)
The scaler learns the mean of every column in *vars* and their respective range.
Note that we can access these values in the following way:

.. code:: python
# access the parameters learned by the scaler
print(f'Means: {scaler.mean_}')
print(f'Ranges: {scaler.range_}')
We see the features' mean and value ranges in the following output:

.. code:: python
Means: {'Age': 19.5, 'Marks': 0.7500000000000001, 'Height': 1.8675000000000002}
Ranges: {'Age': 3.0, 'Marks': 0.30000000000000004, 'Height': 0.22999999999999998}
We can now go ahead and scale the variables:

.. code:: python
# scale the data
df = scaler.transform(df)
print(df)
In the following output, we can see the scaled variables:

.. code:: python
Name City Age Height Marks dob
0 tom London 0.166667 -0.293478 0.500000 2020-02-24 00:00:00
1 nick Manchester 0.500000 -0.423913 0.166667 2020-02-24 00:01:00
2 krish Liverpool -0.166667 0.141304 -0.166667 2020-02-24 00:02:00
3 jack Bristol -0.500000 0.576087 -0.500000 2020-02-24 00:03:00
We can restore the data to itsoriginal values using the inverse transformation:

.. code:: python
# inverse transform the dataframe
df = scaler.inverse_transform(df)
print(df)
In the following data, we see the scaled variables returned to their oridinal representation:

.. code:: python
Name City Age Height Marks dob
0 tom London 20 1.80 0.9 2020-02-24 00:00:00
1 nick Manchester 21 1.77 0.8 2020-02-24 00:01:00
2 krish Liverpool 19 1.90 0.7 2020-02-24 00:02:00
3 jack Bristol 18 2.00 0.6 2020-02-24 00:03:00
Additional resources
--------------------

For more details about this and other feature engineering methods check out
these resources:


.. figure:: ../../images/feml.png
:width: 300
:figclass: align-center
:align: left
:target: https://www.trainindata.com/p/feature-engineering-for-machine-learning

Feature Engineering for Machine Learning

|
|
|
|
|
|
|
|
|
|
Or read our book:

.. figure:: ../../images/cookbook.png
:width: 200
:figclass: align-center
:align: left
:target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587

Python Feature Engineering Cookbook

|
|
|
|
|
|
|
|
|
|
|
|
|
Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.
59 changes: 59 additions & 0 deletions docs/user_guide/scaling/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
.. -*- mode: rst -*-
.. _scaling_user_guide:

.. currentmodule:: feature_engine.scaling

Scaling
=======

`Feature scaling <https://www.blog.trainindata.com/feature-scaling-in-machine-learning/>`_
is the process of transforming the range of numerical features so that they fit within a
specific scale, usually to improve the performance and training stability of machine learning
models.

Scaling helps to normalize the input data, ensuring that each feature contributes proportionately
to the final result, particularly in algorithms that are sensitive to the range of the data,
such as gradient descent-based models (e.g., linear regression, logistic regression, neural networks)
and distance-based models (e.g., K-nearest neighbors, clustering).

Feature-engine's scalers replace the variables' values by the scaled ones. In this page, we
discuss the importance of scaling numerical features, and then introduce the various
scaling techniques supported by Feature-engine.

Importance of scaling
---------------------

Scaling is crucial in machine learning as it ensures that features contribute equally to model
training, preventing bias toward variables with larger ranges. Properly scaled data enhances the
performance of algorithms sensitive to the magnitude of input values, such as gradient descent
and distance-based methods. Additionally, scaling can improve convergence speed and overall model
accuracy, leading to more reliable predictions.


When apply scaling
------------------

- **Training:** Most machine learning algorithms require data to be scaled before training,
especially linear models, neural networks, and distance-based models.

- **Feature Engineering:** Scaling can be essential for certain feature engineering techniques,
like polynomial features.

- **Resampling:** Some oversampling methods like SMOTE and many of the undersampling methods
clean data based on KNN algorithms, which are distance based models.


When Scaling Is Not Necessary
-----------------------------

Not all algorithms require scaling. For example, tree-based algorithms (like Decision Trees,
Random Forests, Gradient Boosting) are generally invariant to scaling because they split data
based on the order of values, not the magnitude.

Scalers
-------

.. toctree::
:maxdepth: 1

MeanNormalizationScaler
10 changes: 10 additions & 0 deletions feature_engine/scaling/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
"""
The module scaling includes classes to transform variables using various
scaling methods.
"""

from .mean_normalization import MeanNormalizationScaler

__all__ = [
"MeanNormalizationScaler",
]
Loading

0 comments on commit ca28618

Please sign in to comment.