Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pancreas list - dcp to tier 1 #1283

Open
arschat opened this issue Jul 30, 2024 · 6 comments
Open

Pancreas list - dcp to tier 1 #1283

arschat opened this issue Jul 30, 2024 · 6 comments
Labels
HCA operations This issue is an operational task

Comments

@arschat
Copy link
Collaborator

arschat commented Jul 30, 2024

Lucia reached out to us to help with tier 1 of pancreas.

The pancreas team has had some troubles progressing their work in the last few months for a whole bunch of reasons. To move forward with the blessing of the bionetwork, they are manually collecting their tier 1 metadata (as much as possible) for publications.

Is there any way they could easily leverage the wrangling you have already done to get the partially filled metadata sheets to data contributors to fill in the gaps?

Using the DCP to Tier 1 notebook, we can generate tier 1 spreadsheets.

@arschat arschat added the operations This issue is an operational task label Jul 30, 2024
@arschat
Copy link
Collaborator Author

arschat commented Jul 30, 2024

We have successfully generated spreadsheets for 17 out of 20 projects in the pancreas list here using the notebooks mentioned.
However, after a check there were some inconsistencies on the donor number (2 projects), and most of the projects had accessions as IDs. This would most definately not match with contributor's tier 1, so we requested from integration leads to handle any potential Tier 1 IDs existing in the matrices, and then we could do the mapping of IDs using the names.

list of projects & inconsistencies
project name problem percent of required Tier 1
CPmicroEnvironment inconsistent donor list (12 in publ, 34 in dcp) 76
dcp5924379466506983 inconsistent donor list (25 in dbgap, 39 in dcp) 62
EastAsianPancreaticIslets accession for donor_id, sample_id, cs_id (GSE97655) 69
Faryabi-Human-10x3pv2 accession for cs_id, ID in name 72
GarciaOcana-Human-10x3pv3 accession for sample_id, cs_id 69
Healthy_and_type_2_diabetes_pancreas #N/A
HealthyAndDiabeticPancreas accession for sample_id, cs_id (GSE101207), different donor_id than figure 2A 66
Herrera-Human-10x3pv3 accession for donor_id, sample_id, cs_id (GSE150724) 69
Herrera-Human-10x3pv3 accession for cs_id (GSM2194190) 69
HumanEndocrinePancreas generic cs_id (GSE81608) 72
HumanIsletType2Diabetes 72
HumanMousePancreas prefix in donor_id, s. accession in cs_id, id in name 62
HumanT2DPancreas accession for sample_id, cs_id (GSE198623) 72
Lickert-Human-10x3pv2 accession for sample_id, cs_id (GSE85241), id in name 69
pancreasCelSeq2 accession for sample_id, cs_id (GSE114297) 72
pancreasNormalIslets accession for sample_id, cs_id (GSE183568), id in GEO name & CS name 76
scHumanPancreaticIslets accession for cs_id (GSE81547), id in GEO name 69

@arschat
Copy link
Collaborator Author

arschat commented Aug 16, 2024

Waiting from Sara Jimènez and Shrey Parikh to reply with concatenated datasets and the appropriate IDs.

@arschat
Copy link
Collaborator Author

arschat commented Aug 28, 2024

Sara and Shrey provided us the contacting sheet they use.

Project Name Project UUID Study DOI DCP Ingest
CPmicroEnvironment c5ca43aa-3b2b-4216-8eb3-f57adcbc99a1 GSE165045 10.1136/gutjnl-2021-324546 DCP Ingest
dcp5924379466506983 bcdf233f-9246-4c0c-9843-0514120b7e3a GSE155698 10.1038/s43018-020-00121-4 DCP Ingest
EastAsianPancreaticIslets e77fed30-959d-4fad-bc15-a0a5a85c21d2 GSE97655 10.1038/s41598-017-05266-4 DCP Ingest
Faryabi-Human-10x3pv2 daef3fda-2620-45ae-a3f7-1613814a35bf GSE148073 10.1038/s42255-022-00531-x DCP Ingest
GarciaOcana-Human-10x3pv3 17cf943b-e247-454f-908b-da58665fcc56 GSE217837 10.1186/s13073-023-01179-2 DCP Ingest
Healthy_and_type_2_diabetes_pancreas ae71be1d-ddd8-4feb-9bed-24c3ddb6e1ad E-MTAB-5061 10.1016/j.cmet.2016.08.020 DCP Ingest
HealthyAndDiabeticPancreas 1c6a960d-52ac-44ea-b728-a59c7ab9dc8e GSE101207 10.1016/j.celrep.2019.02.043 DCP Ingest
Herrera-Human-10x3pv3 7a8d45f1-353b-4508-8e89-65a96785b167 GSE150724 10.1038/s41467-022-29588-8 DCP Ingest
HumanEndocrinePancreas 99101928-d9b1-4aaf-b759-e97958ac7403 GSE83139 10.2337/db16-0405 DCP Ingest
HumanIsletType2Diabetes 7adede6a-0ab7-45e6-9b67-ffe7466bec1f GSE81608 10.1016/j.cmet.2016.08.018 DCP Ingest
HumanMousePancreas f86f1ab4-1fbb-4510-ae35-3ffd752d4dfc GSE84133 10.1016/j.cels.2016.08.011 DCP Ingest
HumanT2DPancreas c6ad8f9b-d26a-4811-b2ba-93d487978446 GSE86473 10.1101/gr.212720.116 DCP Ingest
Lickert-Human-10x3pv2 28dd1438-8f40-40d0-8e53-ee3301b66218 GSE198623 10.1016/j.molmet.2022.101595 DCP Ingest
pancreasCelSeq2 894ae6ac-5b48-41a8-a72f-315a9b60a62e GSE85241 10.1016/j.cels.2016.09.002 DCP Ingest
pancreasNormalIslets 27e2e0ae-5971-4927-aac1-19e81804097b GSE114297 10.2337/db18-0365 DCP Ingest
scHumanPancreaticIslets daa371e8-1ec3-43ef-924f-896d901eab6f GSE183568 10.1172/jci.insight.151621 DCP Ingest
SPAN 1e618693-aa16-4414-bb11-5c8baf9a5c4d PANC-DB 10.1038/s42255-023-00806-x DCP Ingest
HPAPCorrPatil10x b3938158-4e8d-4fdb-9e13-9e94270dde16 Charite snRNA 10.1053/j.gastro.2020.11.010 DCP Ingest
Chen-Human-ATACseq de0bb4b0-b691-46d7-83db-55670f0afc3a GSE233476 10.21203/rs.3.rs-3343318/v1 DCP Ingest

@arschat
Copy link
Collaborator Author

arschat commented Sep 3, 2024

  • sample_source since all datasets did not have the newly implemented transplant_organ field filled, I investigated separately the source of datasets. All projects mention or imply that pancreas donor was originated for transplantation, or similar procedure was executed.

    specifically
    • The Prodo laboratories, specify that they...

    deliver human research pancreases from United Network for Organ Sharing (UNOS) from listed cadaver organ donors that are refused for primary human pancreas transplantation or isolated islet transplanted into listed diabetic recipients.

    • Integrated Islet Distribution Program (IIDP) specify ...

    The IIDP depends on the subcontracted human islet isolation centers to provide research investigators with human islets. It is the responsiblity of the human islet centers to obtain research quality pancreata from the Organ Procurement Organizations (OPO). Some criteria are stricter than those used by transplant centers for organ transplant donors.

    • Alberta Diabetes Institute IsletCore specify in their record sheet that they're following a procedure that includes cold perfusion
    • For project CPmicroEnvironment

    Pancreatic tissue was only used if the pancreas could not be used for clinical pancreas or islet transplantation

@arschat
Copy link
Collaborator Author

arschat commented Sep 3, 2024

Draft email.

Hi everyone,

After our discussion last week, I validated and mapped ~75% of required tier 1 fields and some available optional. The folder with the files is here https://drive.google.com/drive/folders/1iTioXQelz3_GEyZ4UlcOkUYyKmHk7eq1 .

Filename pancreas_dcp_tier1.xlsx has all the Tier 1 metadata that we currently hold in DCP as tier 1 metadata, mapped to the library level. The most useful information is in tab dcp_tier1_metadata and other tabs are there to validate between what you've already wrangled in "contacting_sheet", and our metadata.

There are a few comments I would like to make:

  1. I included some extra fields that are not part of Tier 1 in order to help validation and mapping of IDs. These fields have a ~ as prefix.
  2. For manner_of_death we did not have any information on hardy scale, however, we usually have some info about the cause of death which might be feasible to translate to hardy scale (field ~manner_of_death_string).
  3. For sample_source since in most cases the organism was refered as organ donor/ donor organism or any kind of cold perfusion is referenced. Therefore, I assumed that in all cases the donor was not postmortem donor but organ donor or surgical donor (biopsy or blood draw).
  4. sample_collection_method for organ donation I added surgical resection, however I am not sure if that would be consistent across bionetworks.
  5. tissue_type is tissue, however it is standard procedure for almost all projects to isolate pancreatic islets and culture for a few hours/days. Existing CxG datasets that do follow this procedure have tissue in the tissue_type and not cell culture or organoid therefore I kept the same for all other datasets.
  6. sequencing_platform there is one dataset (PANC-DB) that two sequencers are mentioned but it is not clear which library was sequenced with which sequencer.
  7. library_preparation_batch and library_sequencing_run was not recorded at the time of dataset wrangling however accessions or any other information recorded is provided in case it might be useful.
  8. is_primary_data is not information that we can record in our schema.
  9. reference_genome for dataset GSE198623, in your contacting_sheet is listed as GRCh38 however in GEO it is mentioned that it's hg19/ GRCh37
  10. disease_ontology_term_id there is one dataset (PANC-DB) where the presence of Auto-Antibody (AAB) is mentioned in the disease field, however, as far as I understand, that presence does not constitute disease or a MONDO ontology term. Therefore, we have listed these donors as normal/ non-diseased.
  11. self_reported_ethnicity_ontology_term_id for Tier 1 as far as I am aware this is going to be filled with "unknown" due to privacy reasons. However, I've added the ethnicities that we currrently hold in DCP here (recorded from public archives).
  12. There are 2 published datasets that are yet to be wrangled.

I also added mappings for the IDs you provided in the all_ids_pancreas_wo_index.csv file. Here are some comments about those:

  1. There are 5 datasets that have one singular donor_id, however we've wrangled multiple donors (GSE217837, GSE81608, GSE83139, GSE165045, GSE183568).
  2. There are 79 sample GSM* ids that match to a mouse sample. Would we like us to include mouse data as well? There are some projects from the list that have both human and mouse data.
  3. There are IDs from Escape which is not yet published. We are not able to wrangle and provide any metadata for projects that are not published yet.

I hope this will help you with your integration steps. Let us know if you have more questions.

Best regards,
Arsenios

@arschat
Copy link
Collaborator Author

arschat commented Sep 18, 2024

Added also the charite dataset, and asked question about further wrangling:

Hi Sara, Daniel and Shrey,
I wanted to give you an small update about this. We added the charite snRNA dataset in the pancreas_dcp_tier1.xlsx file.
There is another dataset GSE233476 colored "peach" in the contacting_sheet - contacting, that we can't find single cell data to download, only bulk RNA & ATAC seq. We were wondering if you had the chance to get the single cell data from contributors, or if you are still insterested in the bulk part of dataset as well.
In the all_ids_pancreas_wo_index file, we found some IDs from the nicheformer dataset. However, neither nicheformer, nor the original publication of those samples ( GSE156728) were in the list of the contacting_sheet. Would you like to include any of these in the publication list as well?
All pancreas datasets that are marked as green or peach in contacting_sheet have been wrangled, converted to Tier 1 format and appended in the pancreas_dcp_tier1.xlsx file, except from the GSE233476 mentioned above and the unpublished onces which we don't have access to.
Let us know if that's enough or if we can help with something else.
Best,
Arsenios

@idazucchi idazucchi added the HCA label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HCA operations This issue is an operational task
Projects
None yet
Development

No branches or pull requests

2 participants