Machine Learning Feature Engineering Techniques.
A general machine learning project follows the following steps.
Data Analysis:
-
Feature Creation:
- Extracting features from dates
- Extracting features from Mixed variables
- Missing data imputation
- Categorical variable imputation
- Numerical variable transformation
- Discretization
- Outlier Handling
- Feature Scaling
- Missing data imputation
- Categorical variable encoding
- Numerical variable transformation
- Discretization
- Engineering of datetime variables
- Engineering of coordinates — GIS data
- Feature extraction from text
- Feature extraction from images
- Feature extraction from time series
- New feature creation by combining existing variables
-
Missing data imputation:
- mean
- median
- mode
- arbitrary
- end of tail and random sample imputation
- multivariate imputation.
-
Categorical variable encoding:
- one-hot
- ordinal
- mean encoding
- weight-of evidence
- binarization,
- feature hashing.
-
Numerical variable transformation:
- logarithmic
- reciprocal
- exponential
- Box-Cox
- Yeo-Johnson transformations.
-
Variable discretization:
- equal width discretization
- equal-frequency discretization
- k-means discretization
- decision trees discretization
-
Outlier removal:
- trimming
- capping
- Winsorization
-
Feature Scaling:
- standardization
- MinMax scaling
- robust scaling
- norm scaling
-
Engineering of datetime variables:
- extracting features from day, month and year parts, and capturing elapsed time including in different time zones.
-
Engineering of mixed numerical and categorical variables
-
Compared code implementation with different available open source Python packages, like Scikit-learn, and Category encoders.