Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROSMAP #4

Open
Aryllen opened this issue Sep 29, 2020 · 27 comments
Open

ROSMAP #4

Aryllen opened this issue Sep 29, 2020 · 27 comments
Labels
curation issue related to curation or cleaning of AD portal data

Comments

@Aryllen
Copy link
Collaborator

Aryllen commented Sep 29, 2020

Study folder: syn3219045

We can expand these checks to be more specific, or mark them off/remove them if they are not relevant.

Folder Structure

  • Top 3 folders are Analysis, Data, Staging
  • Top level folders within Data are based on DataType with the exception of Metadata
  • Metadata folder is within Data
  • Staging folder is clean (old data in a subfolder -- Archived)

Metadata (within file)
Checks for each metadata file:

file exists
file name follows schema
contents follow current template - deprecate old versions, if needed
no duplicate individualID/specimenID as appropriate
follows data dictionary

  • ChIPseq
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • methylation
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • single cell RNAseq
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • RNAseq
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • rnaArray
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • nanostring (miRNAcounts)
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • snpArray (GWAS)
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • WGS
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • TMT quantitation
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • proteomics (SRM)
    • metadata file exists and follows naming schema
    • column names follow current template
    • no duplicate specimenID
    • all specimenIDs in biospecimen metadata
    • values follow data dictionary guidelines
  • individual
    • no duplicate individualID
    • all individualIDs in biospecimen metadata
  • biospecimen
    • metadata file exists and follows naming schema
    • column names follow current template
    • all specimenIDs mapped to assays
    • all individualIDs in individual metadata
    • values follow data dictionary guidelines

Metadata (across files)

  • There are duplicate individualID/specimenID for different individuals/specimens, but they are unique within assay (for the moment). Due to the duplicates, need to filter biospecimen by assay or join assay and biospecimen by both specimenID and assay (if the assay has this column).
  • release approved metadata. For new files, move to metadata folder. For updates to existing files, upload as new version.
  • deprecate old covariate files that are being replaced with metadata files.

Annotations

  • WGS
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • snpArray (GWAS)
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • RNAseq
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • single cell RNAseq
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • proteomics (SRM
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • TMT quanititation
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • confocal imaging
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • methylation
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • nanostring (miRNAcounts)
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • rnaArray
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • ChIPseq
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary

Multispecimen Files
Check that specimenIDs in files match IDs in metadata.

  • Assume multispecimen files can be matched up to either specimenID, individualID, or projid.

Wikis

  • appear up to date
  • remove references to deprecated files (after deprecating)
  • are in correct location (on dataType folder)
  • are referenced in portal Study table

Clinical data

  • Braak and CERAD is available on donors with postmortem tissue
    • When Yan gives us the latest version of the file, that will be the most up to date that we can get for those patients.
  • Permission to use Braak and CERAD to generate Dx (AD, NCI, Other) for data contributor

Access (Human)

  • Add any special access needs/fixes/checks here

Portal

  • Review content on the study card for accuracy
  • Review text formatting and 'Show More' section: ### for header, bold for sub-headers, Show More section broken up in a consistent manner on the card
  • Related studies are linked
  • Study has an acknowledgement statement (wikis here) #29
@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 17, 2020

Notes:

  • micro RNA array will use nanostring template. Need new assay 'miRNAcounts'.
  • RNA array needs new template. Base uploaded here.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 18, 2020

Question: Is this metabolomics folder supposed to be empty? Can it be deleted?

Question: Imaging metadata? Answer: Don't add.

Updates:

  • Started preparing for going through the existing metadata and creating new metadata.
  • Cleaned Staging folder:
    • Moved Vilas' data into a 2020_11 folder to keep organized better.
    • Moved all folders with data into Archived folder. Concern: I know why the data in one of the folders is still in staging (metadata that was uploaded as a new version of the existing metadata). However, there is other data in the staging area that may need to be reviewed for whether it should have been/needs to be released.

rnaSeq

  • all specimenIDs in annotations and biospecimen metadata
  • all required annotations exist, but were/are not necessarily correct
    • updated isMultiSpecimen, analysisType, isModelSystem, species.
    • grant number is missing on 1574 files. I have tried querying for them, but I believe I ran into a limit number of 'IN' values / character length. Here is a partial query, and the list of synIDs is uploaded as a text file here.
    • diagnosis missing on some files. Will update this one later.
    • sex and individualID missing on two files: "syn4213070", "syn4213093". Will update this later.
    • notes on annots to check that I haven't yet: platform, readLength, runType, libraryPrep. These have missing values. Need to see if they are in metadata or can be pulled into meta/annots via descriptions.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 19, 2020

Removed tasks related to metabolomics. This assay will be moved out of the main ROSMAP study and will not count toward 'clean' completion.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 21, 2020

rnaSeq
Problem specimen in RNAseq: 492_120515 from batch 1

  • RIN: 9.1, 9.140738 -- most likely a sig-fig issue. Should both be kept (9.1 seems more reasonable)?
  • libraryBatch: 0, 6, 7 -- which batch is this one in??
  • sequencingBatch: 0, 6, 7 -- same question ^

Fixed duplicates in rnaSeq metadata. The problem specimen, above, has all values in it's column for now and is in the first row of the file. Uploaded to cleaning folder here.

  • Note: We have requested missing info multiple times. Currently assuming this is the extent of the information we will get. However, we should still ask, again. It might help to split off specific questions (e.g. for these ids, what is the libraryPreparationMethod and platform?, etc).

ChIPseq

Problem specimen: 11464261

  • sequencingBatch: 16, 59 -- all other information identical; which batch? Currently have it as "16, 59" in new metadata file.

Updates:

  • Covariate file only had specimenID and ChIPBatch. Chose to rename ChIPBatch to sequencingBatch and added to metadata. Can change later if this isn't the correct term.
  • specimenIDs in view were "specimenID.bam"
  • specimenIDs were projids. Both the filenames and the specimenIDs have the projid in them. Mapped specimenID to individualID. Updated annotations, but not filenames. Made mapped individualIDs the specimenID in the assay metadata. Assay metadata will probably conflict with the biospecimen file since it's RIDs, not projids.
  • Uploaded new version of ChIPseq metadata to cleaning folder here.

Multispecimen files (will most likely not be using specimenIDs since these would be RIDs when the 'clean' metadata is uploaded):

  • syn17016846
  • syn17016845
  • syn17016844

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 21, 2020

TMT proteomics

  • all specimenIDs in biospecimen metadata
  • metadata file looks complete.
  • fixed column name specimenID (had capital S). Uploaded new version to current version since change was minor.
  • all files are multispecimen. Unsure what the naming scheme is inside them.
  • Question: metadata file has SampleID, which appears to the batchChannel except for format. batchChannel has a period "b#.#", while SampleID has an underscore "b#_#". Should we just leave this as is?

scrnaSeq

  • all specimenIDs in biospecimen metadata and annotations
  • fixed dictionary terms.
  • removed duplicated specimenIDs. It appeared to be a mistake. Almost all had identical information with the exception of libraryBatch of 1 and some columns being empty.
  • uploaded new version here.
  • multispecimen files follow same naming scheme for specimenIDs.

snpArray

  • all files are multispecimen
  • all specimenIDs in biospecimen metadata
  • removed duplicate IDs
  • added missing keys
  • batch 2 is missing platform. It should be Illumina HumanOmniExpress. I put "Illumina_HumanOmniExpress", but will still need to add the term to the dictionary. Source?: https://www.illumina.com/documents/products/product_information_sheets/product_info_humanomniexpress.pdf
  • Uploaded new version here.
  • Question: There is a discrepancy between assay description and assay metadata. We have 1708 specimenIDs for the first batch, but it says there should be 1709.

WGS

  • Question: description says libraryPreparationMethod is KAPA Hyper Library Preparation Kit. Need source and to add to dictionary and metadata.
  • Added missing keys
  • Added runType and readLength from description
  • Should all be multispecimen files
  • All specimenIDs in biospecimen
  • NOTE: specimenIDS that begin with ROS or MAP are followed by the projid. Since these are not multispecimen files, this should be okay. However, should double check where this could be a problem.
  • Uploaded new version of file here.

methylationArray

  • no duplicated specimenIDs
  • updated column names from covariate file to match template
  • not all specimenIDs in biospecimen metadata
    • missing: PT-M5QO, TBI-AUTO73128-PT-2ZXU, PT-M5QS, TBI-AUTO73300-PT-317T, TBI-AUTO72965-PT-318P, PT-BZBS, PT-BZRQ, PT-BZBH
  • all specimenIDs in annotations are in assay metadata
    • all missing specimenIDs above are not in the annotations, either. What are they for?
  • added platform from description
  • Question: There are columns batch, Sample_Well, Sample_Plate that are in the file. I am not sure which 'batch' this is, either arrayBatch or dnaBatch. Should we keep the other two columns, as well?
  • uploaded new version here.

miRNAcounts

  • need platform source for NanoString nCounter miRNA expression assay to add to dictionary and metadata.
  • not sure what 'plate' would map to in these keys; left in.
  • not all specimenIDs in biospecimen metadata
    • 173 missing, assuming the ones not missing are not duplicated across assays
    • I also checked out the gct file in the past and was able to map the specimenIDs in the header to those in the biospecimen file. So will need to figure out what's up with the IDs.
  • added column 'assay', even though it's not in the template. Question: mRNAcounts is an assay we have. It's Sage defined. Should this be adequate? If not, will need to change in metadata and add new term.
  • all files are multispecimen.
  • metadata file uploaded here.
  • no duplicated specimenIDs (anymore).
  • one specimenID had 10 rows with 8 different plate numbers. This has been collapsed into a single row with all plate numbers. Will need to figure out which is correct.

rnaArray

  • all files are multispecimen
  • got the specimenIDs from this file.
  • all specimenIDs use projids with either ROS_ or MAP_ prepended. Question This is what is in the multispecimen files. We would have to change both the metadata specimenIDs and multispecimen file IDs.
  • Question: Which 'platform' should be in the metadata based on description? Illumina one looks more like libraryPreparationMethod and Little Dipper looks more like the platform. We don't have terms in this metadata file for libraryPreparationMethod, but could add. Once we figure out, need to add to metadata.
  • all specimenIDs are in biospecimen metadata, but will need to be checked for duplicates across assays.
  • uploaded metadata file here.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 21, 2020

label free proteomics

  • Question: why is there a staging folder inaccessible to the public in this folder?
  • Pulled specimenIDs from these two files: 1, batch2
    • NOTE all specimenIDs are projids with the exception of 1 control. There is only 1 specimenID that is not a projid: "33416479". Will need to see if this is a missing individual. Again, will also need to discuss mapping to individualIDs, but the multispecimen files will still have projids.
  • Question: What is the platform ("LC-SRM experiments were performed on a nano ACQUITY UPLC coupled to TSQ Vantage MS instrument")? Need source and to add to dictionary/metadata.
  • Question: We have controlType (value: GIS), but not sure how to use that with their control information described in the assay wiki.
  • uploaded metadata file here.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 23, 2020

General notes:

  • removed tasks related to confocal imaging metadata. According to Mette, this is only for one protein and making metadata for this is unnecessary. The 'metadata' portion of this is considered done.
  • updated individual metadata tasks to remove items that do not fit for ROSMAP. We are not making the individual metadata conform to the Sage dictionary, templates, or naming scheme.
  • the ChIPseq data will need to be renamed and the old versions removed. There is a chance that removing versions could break provenance somewhere, but given that a Sage team has not worked on this data, the chances are low.
  • for data that uses the projid, the assay and biospecimen metadata will have the R-ID as as individualID and, in cases where the specimenID is the projid, then the R-ID will be the specimenID, as well. The biospecimen metadata should have a column for the assay due to the large number of specimenIDs that will be shared/duplicated across assays. There should be notes in the metadata wiki about how to join the metadata files. However, there will also need to be a note as to how to find the correct subject referred to in the data since we are not updating multispecimen files to match the R-ID specimenIDs. Instead, users will need to join the metadata files and reference the projid.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 24, 2020

Mapping IDs to projids

  • proteomics
    • Missing clinical information for a subject (found in proteomics metadata). Cannot map to RID.
    • No duplicate ids.
    • Uploaded newly mapped (except for the one mentioned above) file.
  • rnaArray
    • No problems or duplicate ids.
    • Uploaded newly mapped file.
  • snp
    • Missing clinical information for a subject (found in snpArray metadata). Cannot map to RID.
    • Noticed that column names 260/280 and 260/230 were being 'corrected'. Fixed these.
    • No duplicate specimenIDs.
    • Uploaded newly mapped file with the exception of the one individual mentioned above.
  • WGS
    • No problems or duplicate specimenIDs.
    • Column names were incorrect for the same ones as in snp. Fixed.
    • Uploaded newly mapped file.
  • rnaSeq
    • No problems or duplicate specimenIDs.
    • Uploaded newly mapped file.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Nov 25, 2020

biospecimen

This is a mess... There are more specimens in ROSMAP biospecimen file than there are specimens in all of the assay metadata files (even without having all assay specimens), meaning there's either too many duplicates in the biospecimen file OR there are missing specimens in the assay metadata files.

  • Added column 'notes' for the assay type associated with specimen.
  • TMT controls have GISpool as the individualID. Can remove these, if desired.
  • Added 8 methylationArray specimens. Not sure if these were some type of control or actual specimens. individualID = NA. Note that only 708 individuals are mentioned in assay description, but there are at least 740 individuals, with 8 more that are unknown.
  • Overlap between 575 mRNAcounts and rnaSeq specimens. Treated rnaSeq specimens as the 'real' ones. The mRNAcounts were also from the same tissue so should have been identical to the ones used for rnaSeq. Made duplicate rows and labeled second set as mRNAcounts. However, this may not be true!
  • Added 173 non-overlapping mRNAcount specimens. Assumed all were specimens. Found tissue in the assay description. individualID = NA.
  • Proteomics, snpArray, WGS, ChIPseq, and rnaArray, have specimenID == individualID for at least some specimens. Honestly not sure what 'other' data to put for all of the ones that have individualID = specimenID. The assay descriptions are rather sparse. Instead, I just included them as rows and put a note saying what assay they were for.

ROSMAP_biospecimen_metadata_combined has all assay specimens. The number of specimens matches the total number of specimens in these assay files, with one exception for the single 'control' in proteomics.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Dec 4, 2020

  • Updated rnaArray, ChIPSeq, and proteomics files based on feedback from Mette.
  • Meeting set up with Jake to discuss ROSMAP metadata questions:
    • sample swaps -- Jake's QC work
    • Joining ROSMAP metadata -- is it logical? Would notes in biospecimen be better labeled as assay to match specimens better?

@Aryllen
Copy link
Collaborator Author

Aryllen commented Dec 8, 2020

Stuff I did yesterday, but didn't click the 'comment' button on:

  • Updated ROSMAP_biospecimen_metadata_combined.csv with the sample swaps by:
    • removing individualID
    • adding column 'excludeReason' with 'sample swap'
  • Added microglia scrnaSeq to scrnaSeq metadata and biospecimen_combined metadata
    • note that the specimenIDs are projids with '_G' added. Need to update in annotations.
    • all specimens were run twice. There was no good way to distinguish this is the annotations. I appended a '_1' or '_2' to the specimenID. The lanes are specified in the filenames so can be linked for updating annotations.
    • There are some 'cellType' microglia in biospecimen metadata already.

Some of this needs to be fixed. Namely, I confused the microglia scrnaSeq with the bulk microglia rnaSeq. This data needs to be moved to the rnaSeq metadata and the biospecimen assay column updated with the correct term.

  • Fixed

Other update done today:

  • Removed the '2x' in front of readLength in rnaSeq metadata.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Dec 9, 2020

Question: I have 'excludeReason' in biospecimen. Should I also add the boolean 'exclude'?

yes and done

Question: This is related to Jake's concern. I was thinking this would be a big problem, but it somewhat less so. The idea is that it could be hard to get the exact metadata set desired. For example, we have multiple sets of rnaSeq assays. We can join the metadata files by filtering biospecimen to just rnaSeq assay rows. However, that gives a bigger dataset than what was used in just one of those subsets (microglia, for example). According to Jake, bioinformatics professionals may not be great at joins or cleaning. It would potentially also help with reproducibility/transparency to be able to filter to the exact subset of data. My question is how much work do we want to do for the data users?

Probably best handled with an R package.

Question: Mette mentioned that there appears to be a duplication issue with dlpfc scRNAseq. I'm not following. These seem unique to me.

Misunderstanding. This is fine.

Question: For the FACS sorted bulk cell rnaSeq, I added _1 and _2 to the specimenID. The reasoning is in a comment above. This needs to be approved or improved before I change the annotations on these files.

Unnecessary. Just remove 1 and 2 and have the 10 unique specimenIDs. The users can determine lane by 1 and 2 in filenames.

Question: How should I be reading the WGS sample swap file? Path forward? Same question regarding "duplicate" file.

Pull in reasons for excluding, GQN, tissue and organ. Check that there are 17 individuals in our dataset with 2 samples each (there are).

Question: There was a question asked about the one specimen with multiple values in the rnaSeq metadata. This was mentioned in a previous comment above, but this will need to be cleared up with whoever is responsible for that data. It is unknown if that sample was run 3 times or if it was accidentally entered 3 times with different values.

Need to ask Yan Li about this.

Question: Do we even want to mess with multispecimen files at all? Many that I have seen use the projid, which can be found via metadata. The 'annoyance' with these is leading 0's, which is something Jake also mentioned. But overall, there seems to be hesitation with changing ROSMAP data at all so should we consider multispecimen files 'clean'?

Nope. Leave them be.

Question: Can I delete this empty folder? Was there supposed to have been data here?

Deleted.

Question: What's with this Staging folder in Proteomics (SRM)?

Move to deprecated.

Question: May I update wiki links for the portal as I finish updating wiki's (same question for Mayo) or does there need to be an approval process? The updates are only formatting and merging wikis that should be together.

Yes. Follow guide in our 1:1 notes for merging wikis and order in Portal.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Dec 11, 2020

  • Updated biospecimen and WGS metadata to add in information from WGS_sample_qc file (to be deprecated).
  • Padded projids in clinical file to 8 characters with leading 0s.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Dec 11, 2020

  • Updated biospecimen metadata for two specimens with missing individualIDs to have exclude = true and excludeReason = did not finish assessments before death.
  • Question: Some deceased individuals are missing Braak/CERAD. Should we ask for this information? Anything else?

Contacted John Gibbons to request missing info.

  • Question: WGS duplicates file is annotated with DrocNseq study name? Is this a problem if I deprecate it? It's in ROSMAP...
  • Fixed specimenIDs on microglia rnaSeq data.

Annotations
NOTE: I realized that this would be simpler to do once we had all the metadata approved versus before. Stopped after checking a couple folders in WGS. Will need to come back to this.
WGS

  • Need to add platform, runLength, runType when approved
  • Do not have data for libraryPrep
  • This folder says it's all organ = brain, but there are both brain and blood tissues/cells in the WGS metadata. Question: What should we put?

@Aryllen
Copy link
Collaborator Author

Aryllen commented Dec 17, 2020

  • Updated wiki formatting and locations
  • Linked methods in portal with specific order laid out in Mette/Nicole 1:1 notes.
  • Updated related studies in portal.
  • Opened Jira ticket to change AR on folders (put on Data/Staging, remove from main study folder). Analysis folder will need to be moved into ROSMAP after AR is changed.
    • Update: AR resolved. Moved analysis folder to the correct location.

While the checkboxes in the main issue are items that should be completed once we get metadata confirmation, I am adding general reminders here.

  • Annotation audit + fix
  • Deprecation of unnecessary files and removing of references in wikis to deprecated files

Notes:

  • Metabolomics data has not been touched except for minor wiki formatting and the below folder renaming. This will need to be a new issue for moving into separate project and cleaning up.
    • I noticed that we have Metabolomics (bile acids - brain). This will should be the assay type, not metabolite type. I believe the assay should be UPLC-MS and updated.

@Aryllen
Copy link
Collaborator Author

Aryllen commented Feb 17, 2021

Had a meeting with Mette, Abby, and Yan. Yan said there was probably no one there that could check out the metadata and verify that it was good. He mentioned that our metadata was probably better than what they could provide anyway. With this information, we are going to release the new metadata files.

There is one outstanding issue in the rnaSeq metadata where one specimen has 3 batches. Still need to determine which batch they should be in.

Released:

  • ChIPSeq (new version)
  • methylationArray (moved)
  • nanostring (moved)
  • proteomics (moved)
  • rnaArray (moved)
  • rnaSeq (new version)
  • scrnaSeq (new version + updated name)
  • snpArray (new version)
  • wholeGenomeSeq (new version)
  • biospecimen (new version)

Updated naming on TMT quantitation.

Covariates to deprecate (?):

  • ChIPseq
  • arrayMethylation
  • arraymiRNA

@avanlinden
Copy link
Collaborator

@Aryllen when you get a minute can you give me edit privileges on this repo? That way I can edit your original comment to check off boxes and such. Thanks!

@avanlinden
Copy link
Collaborator

avanlinden commented Feb 23, 2021

Annotations

  • WGS
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • snpArray (GWAS)
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • RNAseq
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • single cell RNAseq
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • proteomics (SRM
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • TMT quanititation
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • confocal imaging
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • methylation
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • nanostring (miRNAcounts)
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • rnaArray
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary
  • ChIPseq
    • match metadata information
    • are complete - remove unnecessary annotations, if needed
    • follow data dictionary

@avanlinden
Copy link
Collaborator

I've been checking through the updated biospecimen and assay metadata to make sure I have all the info I will need to update annotations. There are a few studies that look good and are ready to go, and a few that I have questions on. Questions and issues for each set of data are outlined here in this doc.

In the meantime I'll start annotating the RNAseq and scRNAseq files.

@avanlinden
Copy link
Collaborator

@Aryllen There are ~190 specimens in the updated biospecimen metadata file that are missing individual IDs but do NOT have an exclude = true tag and are NOT pooled samples... I read through all your previous notes but couldn't find anything about this many missing individualIDs. Here's teh breakdown by assay:
Screen Shot 2021-03-04 at 4 20 07 PM

Are these just missing? Do I need to try to find these individualIDs somewhere?

@avanlinden
Copy link
Collaborator

Notes on bulk RNAseq annotations:

  • removed extraneous annotations PMI, RIN, analysisType, isConsortiumAnalysis, isStranded, libraryPrepMethod, normalizationType)
  • added nucleicAcidSource = bulk cells for bulk brain and sorted cells for monocytes and microglia
  • removed "analysisType" annotation for fastq files
  • updated the grant key in the fileview schema to be a stringList column, so files generated under two grants have multivalue annotations

To do:

  • need to change diagnosis values of "other" to empty
  • had an issue reuploading annotations where isMultiSpecimen and isModelSystem changes didn't take; need to redo
  • about 1500 files do not have grant information; they are the bulk brain files from all three brain tissue types. Not all bulk brain files are missing grant -- about 1400 do have grant info, split evenly between the two single grants R01AG15819 and U01AG046152. I don't know how to find grant information for these -- ask Mette

@Aryllen
Copy link
Collaborator Author

Aryllen commented Mar 5, 2021

@avanlinden, I think I mistyped a comment in my notes way up above there, which probably attributed to missing this. Sorry! I believe the solution is to check out this deprecated file. This is most likely a leading 0 problem on projid. The other deprecated covariate files are in that same area ([deprecated ROSMAP] (https://www.synapse.org/#!Synapse:syn20682034)).

@avanlinden
Copy link
Collaborator

@Aryllen Oh yep, those are them. Thank you! I will get them joined up just for completeness sake and upload a new version.

@avanlinden
Copy link
Collaborator

Bulk RNAseq annotations are as complete as I can get them:

  • changed diagnosis = "Other" to NA
  • fixed isMultiSpec and isModelSys values to be consistent
  • removed "analysisType = sequence alignment" for fastq files since they are resourceType = experimentalData
  • standardized capitalization of "Human"

Remaining issues:

  • could not find the two rnaSeq specimenIDs with missing individualIDs and no exclude criteria in any of the deprecated covariate files
  • can't determine what the grant number should be for ~1500 files missing grant annotation
  • bulk brain files from the "batch 1 contribution" were sequenced on a HiSeq according to the assay description, but the version is not specified (HiSeq2000 vs HiSeq2500), so I can't annotate these files with platform

Moving on to another assay.

@avanlinden
Copy link
Collaborator

rnaArray annotations are as complete as currently possible:

  • the assay metadata for these files is sparse; just the specimenIDs, assay, and platform. However, the files have annotations including isStranded, runType, readLength, etc, which I cannot confirm either in the metadata files or unstructured metadata. Decided just to leave as is
  • no grant information on these file annotations
  • added isModelSystem = False, all others ok

@avanlinden
Copy link
Collaborator

confocal imaging annotation updates are done:

  • added isModelSystem = False
  • removed nucleicAcidSource annotation; target of the assay is a transcription factor and does not involve nucleic acid

@avanlinden
Copy link
Collaborator

scrnaSeq annotations are done, with one remaining question about diagnosis:

  • added isModelSystem = false,
  • removed "Read" as an annotation
  • added missing sex information from clinical file
  • add platform, runType, and readLength annotations
  • changed diagnosis = Other to NA

remaining issues:

  • readLength is specified for the UMI + transcript, example "26/98". This is also in the assay file. I don't know what our protocol for this is (just the transcript? add them together?) but this is something that will fail when we are validating annotations with json schemas in the future because it's a character, not a number. I left as is for now
  • using the criteria here for diagnosis led to fairly extreme mismatches between "our" diagnosis and the metadata diagnosis/annotation diagnosis. Leaving for now, will confer with Mette. Probably a bigger problem with all ROSMAP files

@avanlinden avanlinden added the curation issue related to curation or cleaning of AD portal data label Nov 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
curation issue related to curation or cleaning of AD portal data
Projects
None yet
Development

No branches or pull requests

2 participants