Filtering samples is (potentially) too strict #43

gwaybio · 2018-04-12T17:09:52Z

Data is currently processed in https://github.com/cognoma/cancer-data/blob/master/2.TCGA-process.ipynb and the final matrices used in downstream analyses include samples that have mutation, expression, and clinical measurements and were not filtered for other reasons.

@kurtwheeler pointed out in cognoma/core-service#99 a potential issue that the current implementation is not finding samples it should. @dhimmel discovered that this was not an issue (at least not primarily an issue) of the backend, but of the data itself.

I outlined current problems with the data in cognoma/core-service#99 (comment) but we can continue this discussion here.

dhimmel · 2018-04-12T17:51:37Z

So the issue is that:

tumors were filtered because they didn't have observed mutations

My thought now is that we remove tumors without any "red" mutations. Relaxing this guideline a bit would be an easy win and could recover many samples

I think it's a mistake to remove samples because they have no mutations, as long as we know those samples are cancers (and not normal tissues... which I think we do). On the other hand, it was presumably me who implemented this filter. So why would I have done something like that (which seems now to be throwing away good data)?

Samples without any mutations can never be positives in the Cognoma ML framework... but they are important negatives. The fact that you can get the cancer without a mutation is of course something that we should model and not ignore.

cgreene · 2018-04-12T17:52:05Z

What is a "red" mutation?

dhimmel · 2018-04-12T17:53:00Z

What is a "red" mutation?

A mutation Xena considers to be severe. #2 (comment)

From http://xena.ucsc.edu/how-we-characterize-mutations/

Red (=1) --> indicates that a non-silent somatic mutation (nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, inframe indels) was identified in the protein coding region of a gene, or any mutation identified in a non-coding gene

cgreene · 2018-04-12T18:04:08Z

Oh - that is extremely conservative. Point mutations don't make it in (basically all the activating Ras mutations are point mutations). Does cognoma actually work for Ras? We should at least include Red and Blue.

gwaybio · 2018-04-12T18:07:44Z

We do include both Red and Blue mutations - my mistake

cgreene · 2018-04-12T18:11:18Z

Are we absolutely sure of that? I would find it quite implausible that there are no more than 14 HGSCs with at least one missense mutation. 95% of them are TP53 mutated, right?

dhimmel · 2018-04-12T18:11:36Z

We do include both Red and Blue mutations - my mistake

The source code says:

cancer-data/scripts/2.TCGA-process.py

Lines 245 to 261 in 383668e

    
           # The next cell specifies which mutations to preserve as gene-affecting, which were chosen according to the red & blue [mutation effects in Xena](http://xena.ucsc.edu/how-we-characterize-mutations/). 
        
           # In[19]: 
        
           mutations = { 
        
               'Frame_Shift_Del', 
        
               'Frame_Shift_Ins', 
        
               'In_Frame_Del', 
        
               'In_Frame_Ins', 
        
               'Missense_Mutation', 
        
               'Nonsense_Mutation', 
        
               'Nonstop_Mutation', 
        
               'RNA', 
        
               'Splice_Site', 
        
               'Translation_Start_Site', 
        
           }

cgreene · 2018-04-12T18:14:14Z

I think it's likely that someone's going to have to walk through this to verify that mutations is being used as intended. Just looking at TP53 alone, there look like there should be more than that:
http://www.cbioportal.org/index.do?cancer_study_id=ov_tcga_pub&Z_SCORE_THRESHOLD=2.0&RPPA_SCORE_THRESHOLD=2.0&data_priority=0&case_set_id=ov_tcga_pub_cna_seq&gene_list=TP53&geneset_list=+&tab_index=tab_visualize&Action=Submit&genetic_profile_ids_PROFILE_MUTATION_EXTENDED=ov_tcga_pub_mutations&genetic_profile_ids_PROFILE_COPY_NUMBER_ALTERATION=ov_tcga_pub_gistic

dhimmel · 2018-04-12T18:14:14Z

Take a look at the source for constructing the mutation matrix:

cancer-data/scripts/2.TCGA-process.py

Lines 288 to 295 in 383668e

    
           # Create a sample (rows) by gene (columns) matrix of mutation status 
        
           gene_mutation_mat_df = (gene_mutation_df 
        
               .pivot_table(index='sample_id', 
        
                            columns='entrez_gene_id', 
        
                            values='count', 
        
                            fill_value=0) 
        
               .astype(bool).astype(int) 
        
           )

So the reason we exclude samples with no mutations is because unless a sample has a single mutation, we don't actually know whether it has sample calls. mc3.v0.2.8.PUBLIC.xena.tsv.gz only contains mutations and does not include any recognition of sequenced samples with zero mutations.

@gwaygenomics do you know a workaround?

dhimmel · 2018-04-12T18:17:19Z

According to 2.TCGA-process.ipynb the mutations that are excluded (exclusion by omission of inclusion) are the following types:

{"3'Flank", "3'UTR", "5'Flank", "5'UTR", 'Intron', 'Silent', 'large deletion'}

Will chat with @gwaygenomics re there's a cBioPortal discrepancy.

gwaybio · 2018-04-12T18:51:53Z

just chatted with @dhimmel

Just looking at TP53 alone, there look like there should be more than that:

We agree - definitely hinting at something being up. We also noticed the addition of a precompiled binary matrix file. This appears to be a new addition to xena. Need to explore further, but this may save us from needing to process ourselves

gwaybio · 2018-04-12T19:19:29Z

Ok - in this binary matrix from Xena, they do rescue many OV samples.

import pandas as pd

xena_binary = pd.read_table('mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz', sep='\t', index_col=0)

# This clinical matrix as processed in https://github.com/cognoma/core-service/issues/99#issuecomment-380876551
ov_samples = clinmat_df.query("acronym == 'OV'").index.tolist()
ov_xena_df = xena_binary.loc[:, ov_samples].dropna(axis='columns')
ov_xena_df.shape

(40543, 62)

And, as a test, the TP53 counts look on target:

ov_xena_df.loc['TP53', :].value_counts()

1    54
0     8
Name: TP53, dtype: int64

I presume that this will rescue many other samples from other cancer-types as well.

cgreene · 2018-04-12T20:00:29Z

Still only 62 samples that make it through? That still seems incredibly low. This means - if I understand correctly - that what we are saying is that there are hundreds of ovarian cancers with no mutations in the blue and red category. Am I understanding this correctly?

cgreene · 2018-04-12T20:01:53Z

Oh - wait - as I'm thinking about it - are these the ovarian cancer samples that were subject to whole genome amplification and thus where we think the calls may be problematic? I think there was a paper on this. Are the dropouts for other cancers as bad?

@gwaygenomics : does this match the dataset used in the TP53 classifier paper?

gwaybio · 2018-04-12T20:24:59Z

are these the ovarian cancer samples that were subject to whole genome amplification and thus where we think the calls may be problematic?

Yeah, I think this is part of the reason why they're filtered (quite stringently) here.

Are the dropouts for other cancers as bad?

I will have to check exact numbers when I'm back at my desk, but I do think it impacted other cancer-types. Although I think OV will end up being the most drastic.

@gwaygenomics : does this match the dataset used in the TP53 classifier paper?

We dropped OV from training because of the TP53 status imbalance, but we were still able to make predictions on the full gene expression dataset. See Figure S6 of that paper. Our predictions align with the cBioPortal link posted previously in this thread!

gwaybio · 2018-04-12T21:05:02Z

After thinking for a bit, I think it may be best for cognoma to use the binary matrix compiled by xena and get the intersection of datasets (as we had been doing previously). This is simpler, reduces processing requirements, and contains high confidence calls. We will also need to emphasize where the data is coming from and how its processed on the cognoma homepage, and also return downloading scripts when the classifier is emailed back to the user.

The alternative would be to include less confident calls as mutation events, which, if I am remembering correctly, we did in Figure S6. This is a legitimate option since it retains more samples, and there is some (although less confident) evidence the mutations are real in the sample. As @dhimmel pointed out, it would be better to throw these samples out (taking the intersection of datasets) than to assume they have zero mutations.

dhimmel · 2018-04-12T21:23:25Z

Just looking at TP53 alone, there look like there should be more than that

For ovarian cancer and TP53, there 11 positives and 3 negatives that are in the aligned dataset (gene expression and mutation data). However, in the complete data, there are 54 positives and 8 negatives. So I think the issue here is that many ovarian cancer samples are missing from the expression dataset.

Note that we make complete data available, but it doesn't help with cognoma classifiers.

I think it may be best for cognoma to use the binary matrix compiled by xena and get the intersection of datasets

One issue IIRC with the binary matrix is that it requires us to map to symbols to entrez gene IDs without chromosome information, which reduces our ability to map.

dhimmel · 2018-04-12T21:59:30Z

I think it's a mistake to remove samples because they have no mutations

The Xena matrix mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz contains 9,104 samples with mutation calls. A quick summary of the number of mutations per sample is below:

>>> import pandas
>>> url = 'https://pancanatlas.xenahubs.net/download/mc3.v0.2.8.PUBLIC.nonsilentGene.xena.gz'
>>> df = pandas.read_table(url, index_col='sample')
>>> df.sum(axis='rows').describe()
count    9104.000000
mean      171.219025
std       519.085213
min         0.000000
25%        26.000000
50%        55.000000
75%       126.000000
max      8354.000000

Hence, some samples have zero mutations in this dataset.

According to 2.TCGA-process.ipynb, we identify 9104 samples in mc3.v0.2.8.PUBLIC.xena.tsv.gz. I'm preparing a pull request to keep samples with no blue or red mutations in our mutation dataset. This will increase the complete mutation matrix to 9104 samples from 9093 previously. Thus the impact is small, but worth fixing I think.

Refs cognoma#43 (comment)

Merges #44 * Retain zero-mutation samples Refs #43 (comment) Increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388 * Create data/diseases.tsv with summary info Supercedes #45

gwaybio · 2018-04-16T15:50:42Z

For ovarian cancer and TP53, there 11 positives and 3 negatives that are in the aligned dataset (gene expression and mutation data). However, in the complete data, there are 54 positives and 8 negatives. So I think the issue here is that many ovarian cancer samples are missing from the expression dataset.

Is this based on the previous expression dataset processing? If I remember correctly, we were removing samples with NA values and many ovarian samples had some.

dhimmel · 2018-04-16T16:03:59Z

Is this based on the previous expression dataset processing?

No on the current. See the latest diseases.tsv: n_samples = 14 and n_mutation_samples = 62 for ovarian serous cystadenocarcinoma.

gwaybio · 2018-04-16T16:41:54Z

Alright, so after thinking some more about this (and based on input from @cgreene ) we should decide to process mutation data for cognoma based on what we think is the right answer, plus what is maintainable. Our options, as far as I see them (from least to most conservative) are:

All public (non germline) MC3 mutation Calls without filtration. The data is posted in the GDC.
All public (non germline) MC3 mutation Calls with Xena "red" and "blue" filtration. When creating the binary matrix, a sample x gene pair is considered mutated if there is any evidence that there is a "deleterious mutation".
- This permits some estimation of OV and LAML mutation calls. Without this step, nearly all OV and all LAML samples are removed
All public (non germline) MC3 mutation calls with Xena applied "PASS" filter. This data is posted as a binary expression matrix in Xena here.
All public (non germline) MC3 mutation calls with Pass filter and "red" and "blue" filter. This is how cognoma is currently filtering.

After chatting with @dhimmel - all are valid options (will depend on project hypotheses and input genes to be classified) but many will require additional maintenance overhead. We are not tied to Xena data, but removing this dependency will require substantial additional processing.

Since it is certainly valid to retain our current mechanism of creating a high confidence true positive binary matrix, and it requires the least amount of maintenance, I think we agreed to keep cognoma data this way for now. @dhimmel, is this an accurate description?

dhimmel · 2018-04-16T17:08:43Z

Thanks @gwaygenomics for breaking down the four filtering options. I think it's helpful to understand what processing steps have gone into our mutation matrix. While different use cases will prefer different levels of processing, I think our current implementation of high-confidence calls (pass filter) with probably effects (red or blue) is safe and versatile. By versatile, I mean suited to many downstream applications including Cognoma classifiers. By safe, I mean likely to avoid certain false conclusions, like associations with low-quality calls or silent mutations.

These decisions can always be revisited, but without a clear evidence and demand from a downstream analysis to do so, I think our time is better spent elsewhere. Thus, I'll close this issue, and we can discuss the potential inclusion of metastases in #46.

gwaybio changed the title ~~Filtering samples is (potentially) to strict~~ Filtering samples is (potentially) too strict Apr 12, 2018

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Apr 12, 2018

Retain zero-mutation samples

77d8b79

Refs cognoma#43 (comment)

dhimmel mentioned this issue Apr 12, 2018

Retain zero-mutation samples #44

Merged

dhimmel closed this as completed Apr 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering samples is (potentially) too strict #43

Filtering samples is (potentially) too strict #43

gwaybio commented Apr 12, 2018

dhimmel commented Apr 12, 2018

cgreene commented Apr 12, 2018

dhimmel commented Apr 12, 2018 •

edited

Loading

cgreene commented Apr 12, 2018

gwaybio commented Apr 12, 2018

cgreene commented Apr 12, 2018

dhimmel commented Apr 12, 2018

cgreene commented Apr 12, 2018

dhimmel commented Apr 12, 2018

dhimmel commented Apr 12, 2018

gwaybio commented Apr 12, 2018 •

edited

Loading

gwaybio commented Apr 12, 2018 •

edited

Loading

cgreene commented Apr 12, 2018

cgreene commented Apr 12, 2018 •

edited

Loading

gwaybio commented Apr 12, 2018 •

edited

Loading

gwaybio commented Apr 12, 2018

dhimmel commented Apr 12, 2018 •

edited

Loading

dhimmel commented Apr 12, 2018

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018 •

edited

Loading

Filtering samples is (potentially) too strict #43

Filtering samples is (potentially) too strict #43

Comments

gwaybio commented Apr 12, 2018

dhimmel commented Apr 12, 2018

cgreene commented Apr 12, 2018

dhimmel commented Apr 12, 2018 • edited Loading

cgreene commented Apr 12, 2018

gwaybio commented Apr 12, 2018

cgreene commented Apr 12, 2018

dhimmel commented Apr 12, 2018

cgreene commented Apr 12, 2018

dhimmel commented Apr 12, 2018

dhimmel commented Apr 12, 2018

gwaybio commented Apr 12, 2018 • edited Loading

gwaybio commented Apr 12, 2018 • edited Loading

cgreene commented Apr 12, 2018

cgreene commented Apr 12, 2018 • edited Loading

gwaybio commented Apr 12, 2018 • edited Loading

gwaybio commented Apr 12, 2018

dhimmel commented Apr 12, 2018 • edited Loading

dhimmel commented Apr 12, 2018

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018

gwaybio commented Apr 16, 2018

dhimmel commented Apr 16, 2018 • edited Loading

dhimmel commented Apr 12, 2018 •

edited

Loading

gwaybio commented Apr 12, 2018 •

edited

Loading

gwaybio commented Apr 12, 2018 •

edited

Loading

cgreene commented Apr 12, 2018 •

edited

Loading

gwaybio commented Apr 12, 2018 •

edited

Loading

dhimmel commented Apr 12, 2018 •

edited

Loading

dhimmel commented Apr 16, 2018 •

edited

Loading