Current dataset seems to have an issue #99

kurtwheeler · 2018-04-11T19:10:27Z

While testing #98 I ran into the following issue while it was trying to load the Mutations table:

django.db.utils.IntegrityError: insert or update on table "mutations" violates foreign key constraint "mutations_sample_id_6a37562c_fk_samples_sample_id"
DETAIL:  Key (sample_id)=(TCGA-04-1348-01) is not present in table "samples".

Checking https://raw.githubusercontent.com/cognoma/cancer-data/master/data/samples.tsv for the string TCGA-04 turned up nothing, so either the samples table is missing data or the mutations table has rows it shouldn't.

The text was updated successfully, but these errors were encountered:

dhimmel · 2018-04-12T15:23:36Z

The following returns True. In the latest GitHub data, samples.tsv and mutation-matrix.tsv.bz2 are sample-aligned. This code requires a recent version of pandas that can read bz2 compressed files from URL.

import pandas

url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/samples.tsv'
sample_df = pandas.read_table(url)

url = 'https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/mutation-matrix.tsv.bz2'
mutation_df = pandas.read_table(url)

all(mutation_df.sample_id == sample_df.sample_id)

Let me check the figshare: https://doi.org/10.6084/m9.figshare.3487685.v7.

dhimmel · 2018-04-12T15:32:15Z

Checksums I compute locally for cognoma/cancer-data@383668e:

$ md5sum data/*.tsv.bz2 data/samples.tsv 
32d561ba90aa6efd2cf3fc29125118fc  data/expression-matrix.tsv.bz2
acd20fbc57ef4f61a515a89c09fb86a3  data/mutation-matrix.tsv.bz2
ef0b6dc40658f66934937e29c5749cca  data/samples.tsv

According to the figshare API, these checksums match the v7 figshare deposits. See https://api.figshare.com/v2/articles/3487685/versions/7

So I think the sample_ids in the raw data are aligned.

dhimmel · 2018-04-12T15:35:22Z

No matches from the following:

grep TCGA-04-1348-01 data/samples.tsv
grep TCGA-04-1348-01 data/mutation-matrix.tsv.bz2

So I guess the question is why is core-service requesting an ID that doesn't exist? Perhaps TCGA-04-1348-01 used to exist but no longer exists in v7?

gwaybio · 2018-04-12T17:03:08Z

hmm, this is interesting. I am glad that core-services caught this, even if it wasn't supposed to. I did some digging into why.

It looks like tumors with 04 acronyms (e.g. TCGA-04-1348-01) are high grade serous ovarian (HGSC) cancer tumors. HGSC has been of primary focus for me in past analyses.

Clinical Matrix all Primary Samples (that weren't redacted)

path = os.path.join('download', 'Survival_SupplementalTable_S1_20171025_xena_sp.tsv.gz')

# Mapping to rename and filter columns
renamer = collections.OrderedDict([
    ('sample', 'sample_id'),
    ('_PATIENT', 'patient_id'),
    ('cancer type abbreviation', 'acronym'),
    ('__placeholder__', 'disease'),
    ('age_at_initial_pathologic_diagnosis', 'age_diagnosed'),
    ('gender', 'gender'),
    ('race', 'race'),
    ('ajcc_pathologic_tumor_stage', 'ajcc_stage'),
    ('clinical_stage', 'clinical_stage'),
    ('histological_type', 'histological_type'),
    ('histological_grade', 'histological_grade'),
    ('initial_pathologic_dx_year', 'initial_pathologic_dx_year'),
    ('menopause_status', 'menopause_status'),
    ('birth_days_to', 'birth_days_to'),
    ('vital_status', 'vital_status'),
    ('tumor_status', 'tumor_status'),
    ('last_contact_days_to', 'last_contact_days_to'),
    ('death_days_to', 'death_days_to'),
    ('cause_of_death', 'cause_of_death'),
    ('new_tumor_event_type', 'new_tumor_event_type'),
    ('new_tumor_event_site', 'new_tumor_event_site'),
    ('new_tumor_event_site_other', 'new_tumor_event_site_other'),
    ('new_tumor_event_dx_days_to', 'days_recurrence_free'),
    ('treatment_outcome_first_course', 'treatment_outcome_first_course'),
    ('margin_status', 'margin_status'),
    ('residual_tumor', 'residual_tumor'),
    ('_EVENT', 'event_status'),
    ('_TIME_TO_EVENT', 'event_days'),
    ('OS', 'dead'),
    ('OS.time', 'days_survived'),
    ('DSS', 'disease_specific_survival_status'),
    ('DSS.time', 'disease_specific_survival_days'),
    ('DFI', 'disease_free_interval_status'),
    ('DFI.time', 'disease_free_interval_days'),
    ('PFI', 'progression_free_interval_status'),
    ('PFI.time', 'progression_free_interval_days'),
    ('Redaction', 'Redaction')
])

clinmat_df = (
    pandas.read_table(path)
    .rename(columns=renamer)
    .merge(disease_df, how='left')
    [list(renamer.values())]
    .set_index('sample_id', drop=False)
    .sort_values('sample_id')
)

# Fix capitalization of gender and race
clinmat_df.gender = clinmat_df.gender.str.title()
clinmat_df.race = clinmat_df.race.str.title()

# Extract sample-type with the code dictionary
clinmat_df = clinmat_df.assign(sample_type = clinmat_df.sample_id.str[-2:])
clinmat_df.sample_type = clinmat_df.sample_type.replace(sampletype_codes_dict)

# Remove "Redacted" tumors
# These patients either withdrew consent or had genome data mismatch errors
clinmat_df = clinmat_df.query('Redaction != "Redacted"').drop("Redaction", axis=1)

# Keep only these sample types
# filters duplicate samples per patient
sample_types = {
    'Primary Solid Tumor',
    'Primary Blood Derived Cancer - Peripheral Blood',
}
clinmat_df.query("sample_type in @sample_types", inplace=True)

Results in:

# Get acronym value counts
clinmat_df.acronym.value_counts()

BRCA    1092
GBM      588
UCEC     547
OV       537
KIRC     536
HNSC     528
LUAD     519
LGG      515
THCA     507
LUSC     504
PRAD     498
COAD     457
STAD     443
BLCA     409
LIHC     376
CESC     307
KIRP     291
SARC     261
LAML     200
ESCA     185
PAAD     185
PCPG     179
READ     166
TGCT     134
THYM     124
SKCM     108
ACC       92
MESO      87
UVM       80
KICH      66
UCS       57
DLBC      48
CHOL      36
Name: acronym, dtype: int64

Filtered Clinical Matrix (samples with mutation, RNAseq, and Clinical attributes)

sample_ids = sorted(clinmat_df.index & gene_mutation_mat_df.index & expr_df.index)

# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :]

Results in:

sample_df.acronym.value_counts()

BRCA    787
LGG     510
LUAD    508
HNSC    499
PRAD    492
THCA    489
LUSC    477
UCEC    436
STAD    411
BLCA    403
KIRC    366
LIHC    357
COAD    287
CESC    286
KIRP    280
SARC    234
ESCA    183
PCPG    178
PAAD    168
GBM     149
TGCT    128
THYM    118
SKCM    103
READ     89
MESO     81
UVM      80
ACC      79
KICH     66
UCS      57
DLBC     37
CHOL     36
OV       14
Name: acronym, dtype: int64

OV (aka HGSC) goes from 537 to 14!

Why does OV drop so many samples?

My thought was that many of these OV tumors were filtered because they didn't have observed mutations. Indeed, here are all of the samples that were filtered from the mutation set.

sample_ids = set(clinmat_df.index) - set(gene_mutation_mat_df.index)

# Filter expression (x) and mutation (y) matrices for common samples
sample_df = clinmat_df.loc[sample_ids, :]
sample_df.acronym.value_counts()

OV      475
BRCA    303
GBM     279
LAML    200
COAD    169
KIRC    168
UCEC    100
READ     77
SARC     25
LUSC     24
HNSC     21
CESC     18
THCA     16
LIHC     14
DLBC     11
KIRP     10
PAAD     10
LUAD      7
MESO      6
TGCT      6
LGG       5
STAD      5
PRAD      5
SKCM      4
THYM      2
BLCA      2
PCPG      1
ESCA      1
Name: acronym, dtype: int64

Besides decimating the ovarian cancer sample set, this step also removes many additional tumors. My thought now is that we remove tumors without any "red" mutations. Relaxing this guideline a bit would be an easy win and could recover many samples, but there are a bit more things to consider here. I intend on following up with this search in cognoma/cancer-data

cgreene · 2018-04-12T17:39:32Z

I was wondering why our HGSC numbers were so low. Can you use the PanCancerAtlas criteria?

kurtwheeler · 2018-04-12T17:52:47Z

Ah well this explains it. I just reran the scripts without changing the locations of the data because I thought it had been updated in place.

@dhimmel can you confirm https://github.com/cognoma/cancer-data/raw/383668e12a80ccbcc75a4930023aed16afbd208b/data/mutation-matrix.tsv.bz2 is the location we should be pulling mutation data? Actually while you're at it can you check the rest of the urls in the acquiredata script?

Please remember that I have no context for why that script is the way it is nor why some of them are coming from raw.githubusercontent.com and some are coming from figshare. Likewise I have no idea what the relation between the figshare URLs you have posted and the raw.githubusercontent.com links you used in your python code above.

dhimmel · 2018-04-12T17:59:15Z

I just reran the scripts without changing the locations of the data because I thought it had been updated in place.

I support always using versioning such that the raw data URLs we use never point to changed versions of the data.

why some of them are coming from raw.githubusercontent.com and some are coming from figshare

In the past, we used fighsare to store large files, which did not fit in GitHub. However, we recently updated cancer-data to use Git LFS. As such, we can now download all files from GitHub. There is no need for the figshare. Therefore, I'd suggest switching everything to versioned GitHub URLs. All the files will be in this directory.

So in other words, we should either be using ALL figshare or ALL github URLs at this point. Given that uploading to figshare requires an extra step, we may want to switch to all github URLs, which will be easiest going forward.

You can add a single commit variable which we can change to control the source data version.

kurtwheeler · 2018-04-12T18:30:55Z

Cool, sounds good to me. I can point to that directory.

kurtwheeler · 2018-04-12T20:14:10Z

One issue with trying to use that directory: there's no diseases.tsv. Has this file been renamed?

dhimmel · 2018-04-12T20:49:33Z

One issue with trying to use that directory

diseases.tsv lives in the download directory. You can use the same commit hash, just change the path to download/diseases.tsv.

kurtwheeler · 2018-04-13T15:20:23Z

Is there a reason that's not consistent? I can make this Just Work with that being it's location, but it involves a tad bit extra code to correct for something that seems like it may not have been intentional.

dhimmel · 2018-04-13T16:11:10Z

Is there a reason that's not consistent

download is for inputs that we incorporate from external locations. data is for datasets we produce. Honestly, diseases.tsv fits in either location. Happy to move it if that will be helpful (after cognoma/cancer-data#44 is merged).

kurtwheeler · 2018-04-13T16:25:59Z

Ah okay, I don't care much one way or another, just didn't want to write code to workaround a bug. However while testing my code for the diseases.tsv I noticed that genes.tsv is also missing. Has that been renamed to expression-genes.tsv?

Refs cognoma/core-service#99 (comment)

dhimmel · 2018-04-13T17:45:29Z

Currently, genes.tsv is being retrieved from the cognoma/genes repository:

core-service/api/management/commands/acquiredata.py

Lines 31 to 34 in b9b2e4f

    
           gene_path = os.path.join(options['path'], 'genes.tsv') 
        
           if not os.path.exists(gene_path): 
        
               gene_url = 'https://raw.githubusercontent.com/cognoma/genes/master/data/genes.tsv' 
        
               urlretrieve(gene_url, gene_path)

If you continue to retrieve it from cognoma/genes, you should switch to using the version used by the version of cancer-data you consume. This commit is hardcoded in 0.genes-download.ipynb.

However, we'll need to look into how the genes table gets used. It may be possible that mutation-genes.tsv from cancer-data is more appropriate if we only care about genes in the context of mutations. mutation-genes.tsv contains one row per gene, but only for genes with mutation data.

dhimmel · 2018-04-16T14:12:35Z

New commit for cancer-data datasets: cognoma/cancer-data@93e4c53

kurtwheeler · 2018-04-25T19:58:28Z

Note to self, use this for diseases: https://github.com/cognoma/cancer-data/blob/master/mapping/diseases.tsv

gwaybio mentioned this issue Apr 12, 2018

Filtering samples is (potentially) too strict cognoma/cancer-data#43

Closed

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Apr 13, 2018

Move diseases.tsv to data

5a87e90

Refs cognoma/core-service#99 (comment)

cgreene added the backlog label May 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current dataset seems to have an issue #99

Current dataset seems to have an issue #99

kurtwheeler commented Apr 11, 2018

dhimmel commented Apr 12, 2018 •

edited

Loading

dhimmel commented Apr 12, 2018

dhimmel commented Apr 12, 2018

gwaybio commented Apr 12, 2018

cgreene commented Apr 12, 2018

kurtwheeler commented Apr 12, 2018

dhimmel commented Apr 12, 2018

kurtwheeler commented Apr 12, 2018

kurtwheeler commented Apr 12, 2018

dhimmel commented Apr 12, 2018

kurtwheeler commented Apr 13, 2018

dhimmel commented Apr 13, 2018 •

edited

Loading

kurtwheeler commented Apr 13, 2018

dhimmel commented Apr 13, 2018

dhimmel commented Apr 16, 2018

kurtwheeler commented Apr 25, 2018

Current dataset seems to have an issue #99

Current dataset seems to have an issue #99

Comments

kurtwheeler commented Apr 11, 2018

dhimmel commented Apr 12, 2018 • edited Loading

dhimmel commented Apr 12, 2018

dhimmel commented Apr 12, 2018

gwaybio commented Apr 12, 2018

Clinical Matrix all Primary Samples (that weren't redacted)

Filtered Clinical Matrix (samples with mutation, RNAseq, and Clinical attributes)

Why does OV drop so many samples?

cgreene commented Apr 12, 2018

kurtwheeler commented Apr 12, 2018

dhimmel commented Apr 12, 2018

kurtwheeler commented Apr 12, 2018

kurtwheeler commented Apr 12, 2018

dhimmel commented Apr 12, 2018

kurtwheeler commented Apr 13, 2018

dhimmel commented Apr 13, 2018 • edited Loading

kurtwheeler commented Apr 13, 2018

dhimmel commented Apr 13, 2018

dhimmel commented Apr 16, 2018

kurtwheeler commented Apr 25, 2018

dhimmel commented Apr 12, 2018 •

edited

Loading

dhimmel commented Apr 13, 2018 •

edited

Loading