Median absolute deviation feature selection #22

dhimmel · 2016-08-01T15:11:28Z

@gwaygenomics presented evidence that median absolute deviation (MAD) feature selection (selecting genes with the highest MADs) can eliminate most features without hurting performance: #18 (comment). In fact, it appears that performance increased with the feature selection, which could make sense if the selection enriched for predictive features, increasing the signal-to-noise ratio.

Therefore, I think we should investigate this method of feature selection further. Specifically, I'm curious whether:

@gwaygenomics' findings hold true for outcomes other than RAS?
MAD is better than MAD / median? I think MAD could be biased against selecting genes that are lowly expressed but still variable?
MAD outperforms random selection of the same feature set size?
MAD performs well for other algorithms besides logistic regression?

I'm labeling this issue a task, so please investigate if you feel inclined.

dhimmel · 2016-08-07T14:17:52Z

In 34225cc -- an example for classifying TP53 mutation -- we did not apply MAD feature selection (notebook). In a8ae611 (pull request #25), @yl565 selected the top 500 MAD genes (notebook).

Before MAD feature selection, training AUROC was 95.9% and testing AUROC was 93.5%. After MAD feature selection, training AUROC was 89.9% and testing AUROC was 87.9%. @yl565, did anything else change in your pull request that would negatively affect performance? If not I think we may have an example of 500-MAD genes being detrimental. See @gwaygenomics's analysis for benchmarking on RAS mutations: 500 genes appears to be borderline dangerous.

yl565 · 2016-08-07T22:37:53Z

Since a pipeline has been used, only X_train is used for feature selection and standardization. This would decrease AUROC but I think it reflects the reality better because we want to use the classifier to predict if the gene will mutate for a patient so the X_test in reality is only 1 sample. Using the entire dataset X for feature selection and standardization will cause overfitting. This figure compares the differences in testing AUROC with varying amount of feature selected my MAD

dhimmel · 2016-08-08T02:10:17Z

@yl565 really informative analysis. Can you share the source code? Checkout GitHub gists if you want a quick way to host a single notebook. Also I'd love to see the graph extended to all ~20,000 genes.

I'm having some trouble comprehending why performance drops off when you feature select and scale on X_train only. I wouldn't think our unsupervised selection and scaling would cause overfitting and X_test is only 10% of the samples. Do you have any insight?

yl565 · 2016-08-08T16:50:59Z

Because there are differences in distribution between training and testing set. This figure shows the genes with the most difference between training and testing data. I guess 7000 samples are not enough to represent the gene variation of the population

Here are the codes:
https://gist.github.com/yl565/1a978e358a00dea573590e0456dfc1b2#file-1-tcga-mlexample-effectoffeaturenumbers-ipynb

dhimmel added the task label Aug 1, 2016

dhimmel mentioned this issue Aug 1, 2016

Use grid_search in notebook and add visualization #18

Merged

yl565 mentioned this issue Aug 9, 2016

Create a template and directory for algorithms #28

Merged

dhimmel mentioned this issue Sep 14, 2016

Decisions required to reach a minimum viable product #44

Open

dhimmel mentioned this issue Oct 6, 2016

Fix cross-validation/grid search/pipeline setup #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Median absolute deviation feature selection #22

Median absolute deviation feature selection #22

dhimmel commented Aug 1, 2016 •

edited

Loading

dhimmel commented Aug 7, 2016

yl565 commented Aug 7, 2016 •

edited

Loading

dhimmel commented Aug 8, 2016

yl565 commented Aug 8, 2016

Median absolute deviation feature selection #22

Median absolute deviation feature selection #22

Comments

dhimmel commented Aug 1, 2016 • edited Loading

dhimmel commented Aug 7, 2016

yl565 commented Aug 7, 2016 • edited Loading

dhimmel commented Aug 8, 2016

yl565 commented Aug 8, 2016

dhimmel commented Aug 1, 2016 •

edited

Loading

yl565 commented Aug 7, 2016 •

edited

Loading