You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Of all the filters for MSDataset objects, the varianceRatio should be applied with precaution, especially if the data is supposed to be processed with machine learning methods later that contain a train/test data split.
Reasoning: In ML, all feature selection steps are to be performed after the train/test split on the training data only in order to avoid possible information leakage of training data into the test data. One of the many popular methods for feature selection is applying a variance threshold (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) or selection for F-score (ANOVA). Applying feature selection before the split is one of the most frequent errors among novices and leads to overestimation of ML models and incorrect subsequent interpretation (see DOI: 10.1021/acs.jproteome.2c00117 for an insightful explanation).
Suggestion: Either add a word of warning to the documentation that application of the varianceRatio filter may impede data integrity for ML data analysis later, or reset the default value of varianceRatio to 1.0 (now: 1.1), which means turning off this filter by default.
By the way, all the other standard feature filters (corrThreshold, rsdThreshold, etc.) are not problematic as they are filtering by robustness criteria, not by biological variance!
The text was updated successfully, but these errors were encountered:
Dear all
Of all the filters for MSDataset objects, the varianceRatio should be applied with precaution, especially if the data is supposed to be processed with machine learning methods later that contain a train/test data split.
Reasoning: In ML, all feature selection steps are to be performed after the train/test split on the training data only in order to avoid possible information leakage of training data into the test data. One of the many popular methods for feature selection is applying a variance threshold (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) or selection for F-score (ANOVA). Applying feature selection before the split is one of the most frequent errors among novices and leads to overestimation of ML models and incorrect subsequent interpretation (see DOI: 10.1021/acs.jproteome.2c00117 for an insightful explanation).
Suggestion: Either add a word of warning to the documentation that application of the varianceRatio filter may impede data integrity for ML data analysis later, or reset the default value of varianceRatio to 1.0 (now: 1.1), which means turning off this filter by default.
By the way, all the other standard feature filters (corrThreshold, rsdThreshold, etc.) are not problematic as they are filtering by robustness criteria, not by biological variance!
The text was updated successfully, but these errors were encountered: