-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Median absolute deviation feature selection #22
Comments
In 34225cc -- an example for classifying TP53 mutation -- we did not apply MAD feature selection (notebook). In a8ae611 (pull request #25), @yl565 selected the top 500 MAD genes (notebook). Before MAD feature selection, training AUROC was 95.9% and testing AUROC was 93.5%. After MAD feature selection, training AUROC was 89.9% and testing AUROC was 87.9%. @yl565, did anything else change in your pull request that would negatively affect performance? If not I think we may have an example of 500-MAD genes being detrimental. See @gwaygenomics's analysis for benchmarking on RAS mutations: 500 genes appears to be borderline dangerous. |
@yl565 really informative analysis. Can you share the source code? Checkout GitHub gists if you want a quick way to host a single notebook. Also I'd love to see the graph extended to all ~20,000 genes. I'm having some trouble comprehending why performance drops off when you feature select and scale on |
Because there are differences in distribution between training and testing set. This figure shows the genes with the most difference between training and testing data. I guess 7000 samples are not enough to represent the gene variation of the population Here are the codes: |
@gwaygenomics presented evidence that median absolute deviation (MAD) feature selection (selecting genes with the highest MADs) can eliminate most features without hurting performance: #18 (comment). In fact, it appears that performance increased with the feature selection, which could make sense if the selection enriched for predictive features, increasing the signal-to-noise ratio.
Therefore, I think we should investigate this method of feature selection further. Specifically, I'm curious whether:
I'm labeling this issue a task, so please investigate if you feel inclined.
The text was updated successfully, but these errors were encountered: