Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

varianceRatio SOP may lead to data leakage for ML applications #113

Open
misch91 opened this issue Aug 9, 2024 · 0 comments
Open

varianceRatio SOP may lead to data leakage for ML applications #113

misch91 opened this issue Aug 9, 2024 · 0 comments

Comments

@misch91
Copy link
Contributor

misch91 commented Aug 9, 2024

Dear all

Of all the filters for MSDataset objects, the varianceRatio should be applied with precaution, especially if the data is supposed to be processed with machine learning methods later that contain a train/test data split.

Reasoning: In ML, all feature selection steps are to be performed after the train/test split on the training data only in order to avoid possible information leakage of training data into the test data. One of the many popular methods for feature selection is applying a variance threshold (e.g., https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) or selection for F-score (ANOVA). Applying feature selection before the split is one of the most frequent errors among novices and leads to overestimation of ML models and incorrect subsequent interpretation (see DOI: 10.1021/acs.jproteome.2c00117 for an insightful explanation).

Suggestion: Either add a word of warning to the documentation that application of the varianceRatio filter may impede data integrity for ML data analysis later, or reset the default value of varianceRatio to 1.0 (now: 1.1), which means turning off this filter by default.

By the way, all the other standard feature filters (corrThreshold, rsdThreshold, etc.) are not problematic as they are filtering by robustness criteria, not by biological variance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant