Skip to content

Answer to how to select variables in data set and build simpler, faster, more reliable and interpretable ML models

Notifications You must be signed in to change notification settings

MvMukesh/FeatureSelection-Framework-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 

Repository files navigation

                        

FeatureSelection-Framework-ML

GitHub Issues GitHub followers GitHub forks GitHub stars

Answer of how to select variables in data set and build simpler, faster, more reliable and interpretable machine learning models


Why Do we Select Features?

  • Easier to implement by software developers --> Model Production
  • Enhance generalisation by reducing overfitting
  • Reduced risk of data errors during model use
  • Simple model are easier to interpret
  • Short training time
  • Data redundancy

Why?? Reducing Features for Model Deployment

  • Smaller json messages sent over to the model
    • Json messages contain only necessary variables / inputs
  • Less lines of code for error handling
    • Error handlers need to be written for each variable / input
  • Less feature engineering code
  • Less information to log

How to make Features selection part of Pipeline ????

Feature Selection can be the part of Pipeline, but it is good to select Feature ahead before building pipeline and make the list of selected features part of the pipeline we want to deploy.


Feature Selection Method Nature Pros Cons
Filter Methods Independent of ML Algorithm
Based only on variable characteristics
Quick Feature Removal
Model Agnostic
Fast Computation
Does not capture redundancy
Does not capture feature interaction
Poor model performance
Wrapper Methods / Greedy Algorithms Consider ML Algorithm
Evaluates subsets/grop of Features
Considers feature interaction
Best performance
Best feature subset for a given algorithm
Not model agnostic(features they find may not be best for certain algorithm)
Computation expensive
Often impracticable
Embedded Methods Feature selection during training of ML algorithm Good model performance
Capture feature interaction
Better than Filter
Faster than Wrapper
Not model agnostic

  1. Feature Selection Methods
  • Filter Methods
    • Variance
    • Correlation
    • Univariate Selection
  • Wrapper Methods
    • Forward Feature Selection
    • Backword Feature Elemenation
    • Exaustive Search
  • Embedded / Hybrid Methods
    • LASSO
    • Tree Importance
  • Moving Forward
Feature Selection Methods Code + Blog Link Video Link
  1. Feature Selection -- Basic Methods
  • Removing
    • Constant Features
    • Quasi-Constant Features
    • Duplicated Features
Feature Selection -- Basic Methods Code + Blog Link Video Link
  1. Feature Selection -- Correlation
  • Removing Correlated Features
  • Basic Selection Methods + Correlation -> Pipeline
Feature Selection -- Correlation Code + Blog Link Video Link

Filter Methods

  1. Univariate Statistical Methods
  • Mutual Information
  • Chi-square distribution
  • Anova
  • Basic Selection Methods + Statistical Methods -> Pipeline
Univariate Statistical Methods -- Filter Method Code + Blog Link Video Link
  1. Other Methods and Metrics
  • Univariate ROC-AUC, MSE etc
  • Method used in a KDD competition - 2009

Wrapper Methods

  1. Wrapper Methods
  • Forward Feature Selection
  • Backward Feature Selection
  • Exhaustive Feature Selection
Wrapper Methods -- Feature Selection Code + Blog Link Video Link

Embedded Methods

  1. Linear Model Coefficients
  • Logistic Regression Coefficients
  • Linear Regression Coefficients
  • Effect of Regularization on Coefficients
  • Basic Selection Methods + Correlation + Embedded -> Pipeline
Linear Model Coefficients Code + Blog Link Video Link
  1. Lasso
  • Lasso
  • Basic Selection Methods + Correlation + Lasso -> Pipeline
Lasso Code + Blog Link Video Link
  1. Tree Importance
  • Random Forest derived Feature Importance
  • Tree importance + Recursive Feature Elimination
  • Basic Selection Methods + Correlation + Tree importance -> Pipeline
Tree Importance Code + Blog Link Video Link

Hybrid Methods

  1. Hybrid Methods
  • Feature Shuffling
  • Recursive Feature Elimination
  • Recursive Feature Addition
Hybrid Methods Code + Blog Link Video Link