Add MeanNormalisationScaler (#806)

* first version of mean normalization * augment coverage * changes after review * add new tests and fix after review * second update after discussion * add mean normalization to the docs * improve docstrings * devide _params into _mean and _var * deleted formula from docstring * add scaling into index * fix flake8 * Update docs/index.rst Co-authored-by: Soledad Galli <[email protected]> * Update docs/index.rst Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * Update feature_engine/scaling/mean_normalization.py Co-authored-by: Soledad Galli <[email protected]> * change to dictionaries * update docs with demo * fix * fix * fix * minor rewording here and there --------- Co-authored-by: Soledad Galli <[email protected]>
feature-engine · Oct 12, 2024 · ca28618 · ca28618
1 parent 3dcc864
commit ca28618
Show file tree

Hide file tree

Showing 12 changed files with 588 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -68,6 +68,7 @@ Please share your story by answering 1 quick question
 * Datetime Features
 * Time Series
 * Preprocessing
+* Scaling
 * Scikit-learn Wrappers
 
 ### Imputation Methods
@@ -110,6 +111,9 @@ Please share your story by answering 1 quick question
 * BoxCoxTransformer
 * YeoJohnsonTransformer
 
+### Variable Scaling methods
+* MeanNormalizationScaler
+
 ### Variable Creation:
  * MathFeatures
  * RelativeFeatures

diff --git a/docs/api_doc/index.rst b/docs/api_doc/index.rst
@@ -48,6 +48,7 @@ Other
    :maxdepth: 1
 
    preprocessing/index
+   scaling/index
    wrappers/index
 
 Pipeline

diff --git a/docs/api_doc/scaling/MeanNormalizationScaler.rst b/docs/api_doc/scaling/MeanNormalizationScaler.rst
@@ -0,0 +1,6 @@
+MeanNormalizationScaler
+=======================
+
+.. autoclass:: feature_engine.scaling.MeanNormalizationScaler
+    :members:
+
diff --git a/docs/api_doc/scaling/index.rst b/docs/api_doc/scaling/index.rst
@@ -0,0 +1,12 @@
+.. -*- mode: rst -*-
+
+Scaling
+=======
+
+Feature-engine's scaling transformers apply various scaling techniques to
+given columns
+
+.. toctree::
+   :maxdepth: 1
+
+   MeanNormalizationScaler
diff --git a/docs/index.rst b/docs/index.rst
@@ -67,6 +67,7 @@ Feature-engine includes transformers for:
 - Datetime features
 - Time series
 - Preprocessing
+- Scaling
 
 Feature-engine transformers are fully compatible with scikit-learn. That means that you can assemble Feature-engine
 transformers within a Scikit-learn pipeline, or use them in a grid or random search for hyperparameters.
@@ -296,6 +297,15 @@ types and variable names match.
 - :doc:`api_doc/preprocessing/MatchCategories`: ensures categorical variables are of type 'category'
 - :doc:`api_doc/preprocessing/MatchVariables`: ensures that columns in test set match those in train set
 
+Scaling:
+~~~~~~~~
+
+Scaling the data can help to balance the impact of all variables on the model, and can improve 
+its performance.
+
+- :doc:`api_doc/scaling/MeanNormalizationScaler`: scale variables using mean normalization
+
+
 Scikit-learn Wrapper:
 ~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/docs/user_guide/index.rst b/docs/user_guide/index.rst
@@ -18,6 +18,7 @@ Transformation
    discretisation/index
    outliers/index
    transformation/index
+   scaling/index
 
 Creation
 --------

diff --git a/docs/user_guide/scaling/MeanNormalizationScaler.rst b/docs/user_guide/scaling/MeanNormalizationScaler.rst
@@ -0,0 +1,176 @@
+.. _mean_normalization_scaler:
+
+.. currentmodule:: feature_engine.scaling
+
+MeanNormalizationScaler
+=======================
+
+:class:`MeanNormalizationScaler()` scales variables using mean normalization. With mean normalization,
+we center the distribution around 0, and rescale the distribution to the variable's value range,
+so that its values vary between -1 and 1. This is accomplished by subtracting the mean of the feature
+and then dividing by its range (i.e., the difference between the maximum and minimum values).
+
+The :class:`MeanNormalizationScaler()` only works with non-constant numerical variables.
+If the variable is constant, the scaler will raise an error.
+
+Python example
+--------------
+
+We'll show how to use :class:`MeanNormalizationScaler()` through a toy dataset. Let's create
+a toy dataset:
+
+.. code:: python
+
+    import pandas as pd
+    from feature_engine.scaling import MeanNormalizationScaler
+
+    df = pd.DataFrame.from_dict(
+        {
+            "Name": ["tom", "nick", "krish", "jack"],
+            "City": ["London", "Manchester", "Liverpool", "Bristol"],
+            "Age": [20, 21, 19, 18],
+            "Height": [1.80, 1.77, 1.90, 2.00],
+            "Marks": [0.9, 0.8, 0.7, 0.6],
+            "dob": pd.date_range("2020-02-24", periods=4, freq="min"),
+        })
+
+    print(df)
+
+The dataset looks like this:
+
+.. code:: python
+
+        Name        City  Age  Height  Marks                 dob
+    0    tom      London   20    1.80    0.9 2020-02-24 00:00:00
+    1   nick  Manchester   21    1.77    0.8 2020-02-24 00:01:00
+    2  krish   Liverpool   19    1.90    0.7 2020-02-24 00:02:00
+    3   jack     Bristol   18    2.00    0.6 2020-02-24 00:03:00
+
+We see that the only numerical features in this dataset are **Age**, **Marks**, and **Height**. We want
+to scale them using mean normalization.
+
+First, let's make a list with the variable names:
+
+.. code:: python
+
+    vars = [
+      'Age',
+      'Marks',
+      'Height',
+    ]
+
+Now, let's set up :class:`MeanNormalizationScaler()`:
+
+.. code:: python
+
+    # set up the scaler
+    scaler = MeanNormalizationScaler(variables = vars)
+
+    # fit the scaler
+    scaler.fit(df)
+    
+The scaler learns the mean of every column in *vars* and their respective range.
+Note that we can access these values in the following way:
+
+.. code:: python
+
+    # access the parameters learned by the scaler
+    print(f'Means: {scaler.mean_}')
+    print(f'Ranges: {scaler.range_}')
+
+We see the features' mean and value ranges in the following output:
+
+.. code:: python
+
+    Means: {'Age': 19.5, 'Marks': 0.7500000000000001, 'Height': 1.8675000000000002}
+    Ranges: {'Age': 3.0, 'Marks': 0.30000000000000004, 'Height': 0.22999999999999998}
+
+We can now go ahead and scale the variables:
+
+.. code:: python
+
+    # scale the data
+    df = scaler.transform(df)
+    print(df)
+
+In the following output, we can see the scaled variables:
+
+.. code:: python
+
+        Name        City       Age    Height     Marks                 dob
+    0    tom      London  0.166667 -0.293478  0.500000 2020-02-24 00:00:00
+    1   nick  Manchester  0.500000 -0.423913  0.166667 2020-02-24 00:01:00
+    2  krish   Liverpool -0.166667  0.141304 -0.166667 2020-02-24 00:02:00
+    3   jack     Bristol -0.500000  0.576087 -0.500000 2020-02-24 00:03:00
+
+We can restore the data to itsoriginal values using the inverse transformation:
+
+.. code:: python
+
+    # inverse transform the dataframe
+    df = scaler.inverse_transform(df)
+    print(df)
+
+In the following data, we see the scaled variables returned to their oridinal representation:
+
+.. code:: python
+
+        Name        City  Age  Height  Marks                 dob
+    0    tom      London   20    1.80    0.9 2020-02-24 00:00:00
+    1   nick  Manchester   21    1.77    0.8 2020-02-24 00:01:00
+    2  krish   Liverpool   19    1.90    0.7 2020-02-24 00:02:00
+    3   jack     Bristol   18    2.00    0.6 2020-02-24 00:03:00
+
+
+Additional resources
+--------------------
+
+For more details about this and other feature engineering methods check out
+these resources:
+
+
+.. figure::  ../../images/feml.png
+   :width: 300
+   :figclass: align-center
+   :align: left
+   :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning
+
+   Feature Engineering for Machine Learning
+
+|
+|
+|
+|
+|
+|
+|
+|
+|
+|
+
+Or read our book:
+
+.. figure::  ../../images/cookbook.png
+   :width: 200
+   :figclass: align-center
+   :align: left
+   :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587
+
+   Python Feature Engineering Cookbook
+
+|
+|
+|
+|
+|
+|
+|
+|
+|
+|
+|
+|
+|
+
+Both our book and course are suitable for beginners and more advanced data scientists
+alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.
diff --git a/docs/user_guide/scaling/index.rst b/docs/user_guide/scaling/index.rst
@@ -0,0 +1,59 @@
+.. -*- mode: rst -*-
+.. _scaling_user_guide:
+
+.. currentmodule:: feature_engine.scaling
+
+Scaling
+=======
+
+`Feature scaling <https://www.blog.trainindata.com/feature-scaling-in-machine-learning/>`_
+is the process of transforming the range of numerical features so that they fit within a
+specific scale, usually to improve the performance and training stability of machine learning
+models.
+
+Scaling helps to normalize the input data, ensuring that each feature contributes proportionately
+to the final result, particularly in algorithms that are sensitive to the range of the data,
+such as gradient descent-based models (e.g., linear regression, logistic regression, neural networks)
+and distance-based models (e.g., K-nearest neighbors, clustering).
+
+Feature-engine's scalers replace the variables' values by the scaled ones. In this page, we
+discuss the importance of scaling numerical features, and then introduce the various
+scaling techniques supported by Feature-engine.
+
+Importance of scaling
+---------------------
+
+Scaling is crucial in machine learning as it ensures that features contribute equally to model
+training, preventing bias toward variables with larger ranges. Properly scaled data enhances the
+performance of algorithms sensitive to the magnitude of input values, such as gradient descent
+and distance-based methods. Additionally, scaling can improve convergence speed and overall model
+accuracy, leading to more reliable predictions.
+
+
+When apply scaling
+------------------
+
+- **Training:** Most machine learning algorithms require data to be scaled before training,
+  especially linear models, neural networks, and distance-based models.
+
+- **Feature Engineering:** Scaling can be essential for certain feature engineering techniques,
+  like polynomial features.
+
+- **Resampling:** Some oversampling methods like SMOTE and many of the undersampling methods
+  clean data based on KNN algorithms, which are distance based models.
+
+
+When Scaling Is Not Necessary
+-----------------------------
+
+Not all algorithms require scaling. For example, tree-based algorithms (like Decision Trees,
+Random Forests, Gradient Boosting) are generally invariant to scaling because they split data
+based on the order of values, not the magnitude.
+
+Scalers
+-------
+
+.. toctree::
+   :maxdepth: 1
+
+   MeanNormalizationScaler
diff --git a/feature_engine/scaling/__init__.py b/feature_engine/scaling/__init__.py
@@ -0,0 +1,10 @@
+"""
+The module scaling includes classes to transform variables using various
+scaling methods.
+"""
+
+from .mean_normalization import MeanNormalizationScaler
+
+__all__ = [
+    "MeanNormalizationScaler",
+]