varianceRatio SOP may lead to data leakage for ML applications #113

misch91 · 2024-08-09T11:17:08Z

Dear all

Of all the filters for MSDataset objects, the varianceRatio should be applied with precaution, especially if the data is supposed to be processed with machine learning methods later that contain a train/test data split.

Reasoning: In ML, all feature selection steps are to be performed after the train/test split on the training data only in order to avoid possible information leakage of training data into the test data. One of the many popular methods for feature selection is applying a variance threshold (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) or selection for F-score (ANOVA). Applying feature selection before the split is one of the most frequent errors among novices and leads to overestimation of ML models and incorrect subsequent interpretation (see DOI: 10.1021/acs.jproteome.2c00117 for an insightful explanation).

Suggestion: Either add a word of warning to the documentation that application of the varianceRatio filter may impede data integrity for ML data analysis later, or reset the default value of varianceRatio to 1.0 (now: 1.1), which means turning off this filter by default.

By the way, all the other standard feature filters (corrThreshold, rsdThreshold, etc.) are not problematic as they are filtering by robustness criteria, not by biological variance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

varianceRatio SOP may lead to data leakage for ML applications #113

varianceRatio SOP may lead to data leakage for ML applications #113

misch91 commented Aug 9, 2024

varianceRatio SOP may lead to data leakage for ML applications #113

varianceRatio SOP may lead to data leakage for ML applications #113

Comments

misch91 commented Aug 9, 2024