-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering samples is (potentially) too strict #43
Comments
So the issue is that:
I think it's a mistake to remove samples because they have no mutations, as long as we know those samples are cancers (and not normal tissues... which I think we do). On the other hand, it was presumably me who implemented this filter. So why would I have done something like that (which seems now to be throwing away good data)? Samples without any mutations can never be positives in the Cognoma ML framework... but they are important negatives. The fact that you can get the cancer without a mutation is of course something that we should model and not ignore. |
What is a "red" mutation? |
A mutation Xena considers to be severe. #2 (comment) From http://xena.ucsc.edu/how-we-characterize-mutations/
|
Oh - that is extremely conservative. Point mutations don't make it in (basically all the activating Ras mutations are point mutations). Does cognoma actually work for Ras? We should at least include Red and Blue. |
We do include both Red and Blue mutations - my mistake |
Are we absolutely sure of that? I would find it quite implausible that there are no more than 14 HGSCs with at least one missense mutation. 95% of them are TP53 mutated, right? |
The source code says: cancer-data/scripts/2.TCGA-process.py Lines 245 to 261 in 383668e
|
I think it's likely that someone's going to have to walk through this to verify that |
Take a look at the source for constructing the mutation matrix: cancer-data/scripts/2.TCGA-process.py Lines 288 to 295 in 383668e
So the reason we exclude samples with no mutations is because unless a sample has a single mutation, we don't actually know whether it has sample calls. @gwaygenomics do you know a workaround? |
According to
Will chat with @gwaygenomics re there's a cBioPortal discrepancy. |
just chatted with @dhimmel
We agree - definitely hinting at something being up. We also noticed the addition of a precompiled binary matrix file. This appears to be a new addition to xena. Need to explore further, but this may save us from needing to process ourselves |
Ok - in this binary matrix from Xena, they do rescue many OV samples. import pandas as pd
xena_binary = pd.read_table('mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz', sep='\t', index_col=0)
# This clinical matrix as processed in https://github.com/cognoma/core-service/issues/99#issuecomment-380876551
ov_samples = clinmat_df.query("acronym == 'OV'").index.tolist()
ov_xena_df = xena_binary.loc[:, ov_samples].dropna(axis='columns')
ov_xena_df.shape
(40543, 62) And, as a test, the TP53 counts look on target: ov_xena_df.loc['TP53', :].value_counts()
1 54
0 8
Name: TP53, dtype: int64 I presume that this will rescue many other samples from other cancer-types as well. |
Still only 62 samples that make it through? That still seems incredibly low. This means - if I understand correctly - that what we are saying is that there are hundreds of ovarian cancers with no mutations in the blue and red category. Am I understanding this correctly? |
Oh - wait - as I'm thinking about it - are these the ovarian cancer samples that were subject to whole genome amplification and thus where we think the calls may be problematic? I think there was a paper on this. Are the dropouts for other cancers as bad? @gwaygenomics : does this match the dataset used in the TP53 classifier paper? |
Yeah, I think this is part of the reason why they're filtered (quite stringently) here.
I will have to check exact numbers when I'm back at my desk, but I do think it impacted other cancer-types. Although I think OV will end up being the most drastic.
We dropped OV from training because of the TP53 status imbalance, but we were still able to make predictions on the full gene expression dataset. See Figure S6 of that paper. Our predictions align with the cBioPortal link posted previously in this thread! |
After thinking for a bit, I think it may be best for cognoma to use the binary matrix compiled by xena and get the intersection of datasets (as we had been doing previously). This is simpler, reduces processing requirements, and contains high confidence calls. We will also need to emphasize where the data is coming from and how its processed on the cognoma homepage, and also return downloading scripts when the classifier is emailed back to the user. The alternative would be to include less confident calls as mutation events, which, if I am remembering correctly, we did in Figure S6. This is a legitimate option since it retains more samples, and there is some (although less confident) evidence the mutations are real in the sample. As @dhimmel pointed out, it would be better to throw these samples out (taking the intersection of datasets) than to assume they have zero mutations. |
For ovarian cancer and TP53, there 11 positives and 3 negatives that are in the aligned dataset (gene expression and mutation data). However, in the complete data, there are 54 positives and 8 negatives. So I think the issue here is that many ovarian cancer samples are missing from the expression dataset. Note that we make complete data available, but it doesn't help with cognoma classifiers.
One issue IIRC with the binary matrix is that it requires us to map to symbols to entrez gene IDs without chromosome information, which reduces our ability to map. |
The Xena matrix >>> import pandas
>>> url = 'https://pancanatlas.xenahubs.net/download/mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz'
>>> df = pandas.read_table(url, index_col='sample')
>>> df.sum(axis='rows').describe()
count 9104.000000
mean 171.219025
std 519.085213
min 0.000000
25% 26.000000
50% 55.000000
75% 126.000000
max 8354.000000 Hence, some samples have zero mutations in this dataset. According to |
Merges #44 * Retain zero-mutation samples Refs #43 (comment) Increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388 * Create data/diseases.tsv with summary info Supercedes #45
Is this based on the previous expression dataset processing? If I remember correctly, we were removing samples with |
No on the current. See the latest |
Alright, so after thinking some more about this (and based on input from @cgreene ) we should decide to process mutation data for cognoma based on what we think is the right answer, plus what is maintainable. Our options, as far as I see them (from least to most conservative) are:
After chatting with @dhimmel - all are valid options (will depend on project hypotheses and input genes to be classified) but many will require additional maintenance overhead. We are not tied to Xena data, but removing this dependency will require substantial additional processing. Since it is certainly valid to retain our current mechanism of creating a high confidence true positive binary matrix, and it requires the least amount of maintenance, I think we agreed to keep cognoma data this way for now. @dhimmel, is this an accurate description? |
Thanks @gwaygenomics for breaking down the four filtering options. I think it's helpful to understand what processing steps have gone into our mutation matrix. While different use cases will prefer different levels of processing, I think our current implementation of high-confidence calls (pass filter) with probably effects (red or blue) is safe and versatile. By versatile, I mean suited to many downstream applications including Cognoma classifiers. By safe, I mean likely to avoid certain false conclusions, like associations with low-quality calls or silent mutations. These decisions can always be revisited, but without a clear evidence and demand from a downstream analysis to do so, I think our time is better spent elsewhere. Thus, I'll close this issue, and we can discuss the potential inclusion of metastases in #46. |
Data is currently processed in https://github.com/cognoma/cancer-data/blob/master/2.TCGA-process.ipynb and the final matrices used in downstream analyses include samples that have mutation, expression, and clinical measurements and were not filtered for other reasons.
@kurtwheeler pointed out in cognoma/core-service#99 a potential issue that the current implementation is not finding samples it should. @dhimmel discovered that this was not an issue (at least not primarily an issue) of the backend, but of the data itself.
I outlined current problems with the data in cognoma/core-service#99 (comment) but we can continue this discussion here.
The text was updated successfully, but these errors were encountered: