Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering samples is (potentially) too strict #43

Closed
gwaybio opened this issue Apr 12, 2018 · 22 comments
Closed

Filtering samples is (potentially) too strict #43

gwaybio opened this issue Apr 12, 2018 · 22 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Apr 12, 2018

Data is currently processed in https://github.com/cognoma/cancer-data/blob/master/2.TCGA-process.ipynb and the final matrices used in downstream analyses include samples that have mutation, expression, and clinical measurements and were not filtered for other reasons.

@kurtwheeler pointed out in cognoma/core-service#99 a potential issue that the current implementation is not finding samples it should. @dhimmel discovered that this was not an issue (at least not primarily an issue) of the backend, but of the data itself.

I outlined current problems with the data in cognoma/core-service#99 (comment) but we can continue this discussion here.

@gwaybio gwaybio changed the title Filtering samples is (potentially) to strict Filtering samples is (potentially) too strict Apr 12, 2018
@dhimmel
Copy link
Member

dhimmel commented Apr 12, 2018

So the issue is that:

tumors were filtered because they didn't have observed mutations

My thought now is that we remove tumors without any "red" mutations. Relaxing this guideline a bit would be an easy win and could recover many samples

I think it's a mistake to remove samples because they have no mutations, as long as we know those samples are cancers (and not normal tissues... which I think we do). On the other hand, it was presumably me who implemented this filter. So why would I have done something like that (which seems now to be throwing away good data)?

Samples without any mutations can never be positives in the Cognoma ML framework... but they are important negatives. The fact that you can get the cancer without a mutation is of course something that we should model and not ignore.

@cgreene
Copy link
Member

cgreene commented Apr 12, 2018

What is a "red" mutation?

@dhimmel
Copy link
Member

dhimmel commented Apr 12, 2018

What is a "red" mutation?

A mutation Xena considers to be severe. #2 (comment)

From http://xena.ucsc.edu/how-we-characterize-mutations/

Red (=1) --> indicates that a non-silent somatic mutation (nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, inframe indels) was identified in the protein coding region of a gene, or any mutation identified in a non-coding gene

@cgreene
Copy link
Member

cgreene commented Apr 12, 2018

Oh - that is extremely conservative. Point mutations don't make it in (basically all the activating Ras mutations are point mutations). Does cognoma actually work for Ras? We should at least include Red and Blue.

@gwaybio
Copy link
Member Author

gwaybio commented Apr 12, 2018

We do include both Red and Blue mutations - my mistake

@cgreene
Copy link
Member

cgreene commented Apr 12, 2018

Are we absolutely sure of that? I would find it quite implausible that there are no more than 14 HGSCs with at least one missense mutation. 95% of them are TP53 mutated, right?

@dhimmel
Copy link
Member

dhimmel commented Apr 12, 2018

We do include both Red and Blue mutations - my mistake

The source code says:

# The next cell specifies which mutations to preserve as gene-affecting, which were chosen according to the red & blue [mutation effects in Xena](http://xena.ucsc.edu/how-we-characterize-mutations/).
# In[19]:
mutations = {
'Frame_Shift_Del',
'Frame_Shift_Ins',
'In_Frame_Del',
'In_Frame_Ins',
'Missense_Mutation',
'Nonsense_Mutation',
'Nonstop_Mutation',
'RNA',
'Splice_Site',
'Translation_Start_Site',
}

@cgreene
Copy link
Member

cgreene commented Apr 12, 2018

@dhimmel
Copy link
Member

dhimmel commented Apr 12, 2018

Take a look at the source for constructing the mutation matrix:

# Create a sample (rows) by gene (columns) matrix of mutation status
gene_mutation_mat_df = (gene_mutation_df
.pivot_table(index='sample_id',
columns='entrez_gene_id',
values='count',
fill_value=0)
.astype(bool).astype(int)
)

So the reason we exclude samples with no mutations is because unless a sample has a single mutation, we don't actually know whether it has sample calls. mc3.v0.2.8.PUBLIC.xena.tsv.gz only contains mutations and does not include any recognition of sequenced samples with zero mutations.

@gwaygenomics do you know a workaround?

@dhimmel
Copy link
Member

dhimmel commented Apr 12, 2018

According to 2.TCGA-process.ipynb the mutations that are excluded (exclusion by omission of inclusion) are the following types:

{"3'Flank", "3'UTR", "5'Flank", "5'UTR", 'Intron', 'Silent', 'large deletion'}

Will chat with @gwaygenomics re there's a cBioPortal discrepancy.

@gwaybio
Copy link
Member Author

gwaybio commented Apr 12, 2018

just chatted with @dhimmel

Just looking at TP53 alone, there look like there should be more than that:

We agree - definitely hinting at something being up. We also noticed the addition of a precompiled binary matrix file. This appears to be a new addition to xena. Need to explore further, but this may save us from needing to process ourselves

@gwaybio
Copy link
Member Author

gwaybio commented Apr 12, 2018

Ok - in this binary matrix from Xena, they do rescue many OV samples.

import pandas as pd

xena_binary = pd.read_table('mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz', sep='\t', index_col=0)

# This clinical matrix as processed in https://github.com/cognoma/core-service/issues/99#issuecomment-380876551
ov_samples = clinmat_df.query("acronym == 'OV'").index.tolist()
ov_xena_df = xena_binary.loc[:, ov_samples].dropna(axis='columns')
ov_xena_df.shape

(40543, 62)

And, as a test, the TP53 counts look on target:

ov_xena_df.loc['TP53', :].value_counts()

1    54
0     8
Name: TP53, dtype: int64

I presume that this will rescue many other samples from other cancer-types as well.

@cgreene
Copy link
Member

cgreene commented Apr 12, 2018

Still only 62 samples that make it through? That still seems incredibly low. This means - if I understand correctly - that what we are saying is that there are hundreds of ovarian cancers with no mutations in the blue and red category. Am I understanding this correctly?

@cgreene
Copy link
Member

cgreene commented Apr 12, 2018

Oh - wait - as I'm thinking about it - are these the ovarian cancer samples that were subject to whole genome amplification and thus where we think the calls may be problematic? I think there was a paper on this. Are the dropouts for other cancers as bad?

@gwaygenomics : does this match the dataset used in the TP53 classifier paper?

@gwaybio
Copy link
Member Author

gwaybio commented Apr 12, 2018

are these the ovarian cancer samples that were subject to whole genome amplification and thus where we think the calls may be problematic?

Yeah, I think this is part of the reason why they're filtered (quite stringently) here.

Are the dropouts for other cancers as bad?

I will have to check exact numbers when I'm back at my desk, but I do think it impacted other cancer-types. Although I think OV will end up being the most drastic.

@gwaygenomics : does this match the dataset used in the TP53 classifier paper?

We dropped OV from training because of the TP53 status imbalance, but we were still able to make predictions on the full gene expression dataset. See Figure S6 of that paper. Our predictions align with the cBioPortal link posted previously in this thread!

@gwaybio
Copy link
Member Author

gwaybio commented Apr 12, 2018

After thinking for a bit, I think it may be best for cognoma to use the binary matrix compiled by xena and get the intersection of datasets (as we had been doing previously). This is simpler, reduces processing requirements, and contains high confidence calls. We will also need to emphasize where the data is coming from and how its processed on the cognoma homepage, and also return downloading scripts when the classifier is emailed back to the user.

The alternative would be to include less confident calls as mutation events, which, if I am remembering correctly, we did in Figure S6. This is a legitimate option since it retains more samples, and there is some (although less confident) evidence the mutations are real in the sample. As @dhimmel pointed out, it would be better to throw these samples out (taking the intersection of datasets) than to assume they have zero mutations.

@dhimmel
Copy link
Member

dhimmel commented Apr 12, 2018

Just looking at TP53 alone, there look like there should be more than that

For ovarian cancer and TP53, there 11 positives and 3 negatives that are in the aligned dataset (gene expression and mutation data). However, in the complete data, there are 54 positives and 8 negatives. So I think the issue here is that many ovarian cancer samples are missing from the expression dataset.

Note that we make complete data available, but it doesn't help with cognoma classifiers.

I think it may be best for cognoma to use the binary matrix compiled by xena and get the intersection of datasets

One issue IIRC with the binary matrix is that it requires us to map to symbols to entrez gene IDs without chromosome information, which reduces our ability to map.

@dhimmel
Copy link
Member

dhimmel commented Apr 12, 2018

I think it's a mistake to remove samples because they have no mutations

The Xena matrix mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz contains 9,104 samples with mutation calls. A quick summary of the number of mutations per sample is below:

>>> import pandas
>>> url = 'https://pancanatlas.xenahubs.net/download/mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz'
>>> df = pandas.read_table(url, index_col='sample')
>>> df.sum(axis='rows').describe()
count    9104.000000
mean      171.219025
std       519.085213
min         0.000000
25%        26.000000
50%        55.000000
75%       126.000000
max      8354.000000

Hence, some samples have zero mutations in this dataset.

According to 2.TCGA-process.ipynb, we identify 9104 samples in mc3.v0.2.8.PUBLIC.xena.tsv.gz. I'm preparing a pull request to keep samples with no blue or red mutations in our mutation dataset. This will increase the complete mutation matrix to 9104 samples from 9093 previously. Thus the impact is small, but worth fixing I think.

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Apr 12, 2018
dhimmel added a commit that referenced this issue Apr 16, 2018
Merges #44

* Retain zero-mutation samples
Refs #43 (comment)
Increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388

* Create data/diseases.tsv with summary info
Supercedes #45
@gwaybio
Copy link
Member Author

gwaybio commented Apr 16, 2018

For ovarian cancer and TP53, there 11 positives and 3 negatives that are in the aligned dataset (gene expression and mutation data). However, in the complete data, there are 54 positives and 8 negatives. So I think the issue here is that many ovarian cancer samples are missing from the expression dataset.

Is this based on the previous expression dataset processing? If I remember correctly, we were removing samples with NA values and many ovarian samples had some.

@dhimmel
Copy link
Member

dhimmel commented Apr 16, 2018

Is this based on the previous expression dataset processing?

No on the current. See the latest diseases.tsv: n_samples = 14 and n_mutation_samples = 62 for ovarian serous cystadenocarcinoma.

@gwaybio
Copy link
Member Author

gwaybio commented Apr 16, 2018

Alright, so after thinking some more about this (and based on input from @cgreene ) we should decide to process mutation data for cognoma based on what we think is the right answer, plus what is maintainable. Our options, as far as I see them (from least to most conservative) are:

  1. All public (non germline) MC3 mutation Calls without filtration. The data is posted in the GDC.
  2. All public (non germline) MC3 mutation Calls with Xena "red" and "blue" filtration. When creating the binary matrix, a sample x gene pair is considered mutated if there is any evidence that there is a "deleterious mutation".
    • This permits some estimation of OV and LAML mutation calls. Without this step, nearly all OV and all LAML samples are removed
  3. All public (non germline) MC3 mutation calls with Xena applied "PASS" filter. This data is posted as a binary expression matrix in Xena here.
  4. All public (non germline) MC3 mutation calls with Pass filter and "red" and "blue" filter. This is how cognoma is currently filtering.

After chatting with @dhimmel - all are valid options (will depend on project hypotheses and input genes to be classified) but many will require additional maintenance overhead. We are not tied to Xena data, but removing this dependency will require substantial additional processing.

Since it is certainly valid to retain our current mechanism of creating a high confidence true positive binary matrix, and it requires the least amount of maintenance, I think we agreed to keep cognoma data this way for now. @dhimmel, is this an accurate description?

@dhimmel
Copy link
Member

dhimmel commented Apr 16, 2018

Thanks @gwaygenomics for breaking down the four filtering options. I think it's helpful to understand what processing steps have gone into our mutation matrix. While different use cases will prefer different levels of processing, I think our current implementation of high-confidence calls (pass filter) with probably effects (red or blue) is safe and versatile. By versatile, I mean suited to many downstream applications including Cognoma classifiers. By safe, I mean likely to avoid certain false conclusions, like associations with low-quality calls or silent mutations.

These decisions can always be revisited, but without a clear evidence and demand from a downstream analysis to do so, I think our time is better spent elsewhere. Thus, I'll close this issue, and we can discuss the potential inclusion of metastases in #46.

@dhimmel dhimmel closed this as completed Apr 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants