Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Median absolute deviation feature selection #22

Open
dhimmel opened this issue Aug 1, 2016 · 4 comments
Open

Median absolute deviation feature selection #22

dhimmel opened this issue Aug 1, 2016 · 4 comments
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Aug 1, 2016

@gwaygenomics presented evidence that median absolute deviation (MAD) feature selection (selecting genes with the highest MADs) can eliminate most features without hurting performance: #18 (comment). In fact, it appears that performance increased with the feature selection, which could make sense if the selection enriched for predictive features, increasing the signal-to-noise ratio.

Therefore, I think we should investigate this method of feature selection further. Specifically, I'm curious whether:

  • @gwaygenomics' findings hold true for outcomes other than RAS?
  • MAD is better than MAD / median? I think MAD could be biased against selecting genes that are lowly expressed but still variable?
  • MAD outperforms random selection of the same feature set size?
  • MAD performs well for other algorithms besides logistic regression?

I'm labeling this issue a task, so please investigate if you feel inclined.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 7, 2016

In 34225cc -- an example for classifying TP53 mutation -- we did not apply MAD feature selection (notebook). In a8ae611 (pull request #25), @yl565 selected the top 500 MAD genes (notebook).

Before MAD feature selection, training AUROC was 95.9% and testing AUROC was 93.5%. After MAD feature selection, training AUROC was 89.9% and testing AUROC was 87.9%. @yl565, did anything else change in your pull request that would negatively affect performance? If not I think we may have an example of 500-MAD genes being detrimental. See @gwaygenomics's analysis for benchmarking on RAS mutations: 500 genes appears to be borderline dangerous.

@yl565
Copy link
Contributor

yl565 commented Aug 7, 2016

Since a pipeline has been used, only X_train is used for feature selection and standardization. This would decrease AUROC but I think it reflects the reality better because we want to use the classifier to predict if the gene will mutate for a patient so the X_test in reality is only 1 sample. Using the entire dataset X for feature selection and standardization will cause overfitting. This figure compares the differences in testing AUROC with varying amount of feature selected my MAD
image

@dhimmel
Copy link
Member Author

dhimmel commented Aug 8, 2016

@yl565 really informative analysis. Can you share the source code? Checkout GitHub gists if you want a quick way to host a single notebook. Also I'd love to see the graph extended to all ~20,000 genes.

I'm having some trouble comprehending why performance drops off when you feature select and scale on X_train only. I wouldn't think our unsupervised selection and scaling would cause overfitting and X_test is only 10% of the samples. Do you have any insight?

@yl565
Copy link
Contributor

yl565 commented Aug 8, 2016

Because there are differences in distribution between training and testing set. This figure shows the genes with the most difference between training and testing data. I guess 7000 samples are not enough to represent the gene variation of the population
image

Here are the codes:
https://gist.github.com/yl565/1a978e358a00dea573590e0456dfc1b2#file-1-tcga-mlexample-effectoffeaturenumbers-ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants