-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could a general mutation-load pattern confound mutation-specific signals? #8
Comments
This is an interesting question. From some quick searching of the academic literature, I've dug up mutational signatures of heavy mutation load cancers (the types of mutations that occur in these cancers seem to be different). I didn't find anything on a gene expression pattern common to them. It may be important to control for confounding by cancer type (maybe you pick the most mutated 10% within each cancer type as positive and the least mutated 10% as negative). I think this is an interesting question that may have just created another use case! |
on a call now and this issue was mentioned. Really, the issue is mainly that these hyper-mutated tumors have a ton of passenger mutations and would contaminate gold standards. The solution proposed involved subsetting mutations using Cancer Hotspots as defined by Chang et al. Essentially what the group is doing is only considering a sample to have a mutation in a given gene if the mutation is found in this database. I don't necessarily know what to do with this info - or if it even makes sense to use at all but generally, using it would increase the percentage of true positives but simultaneously increase false negatives. |
What do you mean by true positives and false negatives? From Chang et al.:
So if we were to only count mutations that were in recurrently mutated residues (cancer hotspots), we would only be able to offer our users a choice between 275 genes — not good? Additionally, I'm not sure I see:
However, I still think a covariate is the way to go and can address most of the problem. A good first analysis to see the extent of this problem would be to measure the AUROC between TP53 mutation status versus total mutation count. |
True positives meaning samples that actually have a deleterious mutation in the given gene (either an activating or inactivating mutation) that leads to a gene expression based signature representative of the normal gene activity being lost. False negatives meaning samples that actually do have the irregular gene expression signature but are incorrectly considered a "0" or "not mutated". Either will decrease the classifiers performance. We can get a false negative from either:
Probably not good, I agree.
aside from removing samples with high mutation load, I don't think anything we do will fully eliminate this confounding. Restricting to hotspots for these samples will remove many passenger mutations that are less likely to alter gene expression signatures associated with the mutation of associated input genes. Adjusting for them when building a model could work nicely too.
The 'let it learn' argument makes much more sense in an unsupervised setting. For a supervised algorithm we are severely impacted by false labeling information and the first question when troubleshooting performance should always be: "is my data good?"
I think this is a great idea! Although we probably should approach it using a gene other than TP53. Since TP53 is crucial for DNA repair, tumors with the defective protein are likely to have more mutations than tumors with wildtype TP53. I would recommend building a new classifier for RAS or NF1, or we can even try using genes in a pathway. E.g. Hippo Signalling Pathway to test this hypothesis. In general, I would be in favor of sticking with our filtered mutation calls as a gold standard for now (at least until cleaner data comes in 😃) and testing to see how much of an impact mutation load has on predictions. |
See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.
See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.
Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to cognoma#8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses cognoma#21: Covariates are extracted from samples.tsv.
* Evaluate performance of covariates on TP53 Creates an explore directory and README for this type of exploratory notebook. See how well covariates (non-expression features) predict TP53 mutation. Related to #8: General mutation-load does provide some ability to predict mutation status of TP53. Partially addresses #21: Covariates are extracted from samples.tsv. * Evaluate more covariate/mutation combinations Evaluate covariate-only classifiers for the interesting mutations compiled in cognoma/cancer-data#22 (comment). Switches to an expand grid system for evaluating all possible covariate combinations. Plot performance of all covariates on each mutation. Switches to `covariates.tsv` created in cognoma/cancer-data#24 for encoded covariates. * Export clean notebook to script * Address review comments
Reproducing a comment by @gwaygenomics here:
I did the Elemento Lab's GitHub organization but I couldn't find the handle for the doctor himself. However, I did find his Twitter, so I'll tweet him the link to this question: Q: We're creating models to predict mutation status at a specific gene using gene expression on TCGA samples. We'd like to add a mutation load covariate and have explored adding Update: link to Tweet |
this issue has come up once again - it appears to be something the field is keenly aware of but do not know of a "best" solution for. It also appears to be extremely important when trying to predict the gene expression signature of samples that have DNA damage repair response defects. Some of the solutions I have seen so far:
I have also seen a number of different ways mutation burden is added to the model. I plan on looking into this today at the meetup and exploring some of the solutions |
I think it's likely that there is a general expression pattern for how mutated a tumor is. For example, super mutated tumors may have wacky gene expression, solely because they're super mutated and not specifically because of which exact mutations they contain.
For a given gene, tumors with mutations are more likely to be highly mutated overall. This could cause confounding. It may appear that a mutation is associated with a specific expression pattern, although the signal is be driven by general mutation-load.
So we may need to end up including a mutation-load covariate. In the meantime, someone should see whether it's possible to use gene expression to predict the mutation-load of each sample (labeling this a task and looking for a volunteer).
The text was updated successfully, but these errors were encountered: