Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AUDR Multiple updates for the Spatio-temporal immune zonation of the human kidney dataset in prod #81

Closed
25 of 27 tasks
lauraclarke opened this issue Jul 2, 2020 · 13 comments
Assignees
Labels
audr This dataset needs to be edited/updated dataset All dataset tickets should have this label, only one ticket per dataset DCP1.0 Label for datasets in DCP1.0 operations This issue is an operational task Release 14 for datasets targeted at DCP data release 14 task A wrangler task

Comments

@lauraclarke
Copy link
Contributor

lauraclarke commented Jul 2, 2020

Dataset/group this task is for:

project full name: Spatio-temporal immune zonation of the human kidney
project short name: KidneySingleCellAtlas
project uuid: abe1a013-af7a-45ed-8c26-f3793c24a1f4
submission date: 2019-08-14T10:22:30.675Z 2019-10-03T12:29:04.626Z 2019-10-22T17:45:52.311Z 2019-09-25T16:40:15.246Z 2019-10-03T12:44:03.880Z
submission uuid: 702313be-fdde-42ea-89a5-bd1b01531736 9cfca427-6e22-447a-867e-4d81fdb7391c d5410c6e-612d-421a-a66f-2de5e04dd050 2afc1a93-f35d-4dec-95b7-7bd54b6da834 9e1d7bdc-e4a8-4dac-a131-6434aeb15bd0
update date: 2019-08-14T10:24:13.705Z 2019-10-03T12:30:26.645Z 2019-10-23T14:50:16.509Z 2019-10-03T09:44:08.217Z 2019-10-03T12:45:15.145Z
involved wranglers: Enrique,,Sapena Ventura;
Analysis state: INCOMPLETE
Project state: INCOMPLETE

Current spreadsheet can be found at

PROJECTS-FINISHED/Benjamin Stewart - Adult fetal kidney

Original ticket in HCA repo https://github.com/HumanCellAtlas/hca-data-wrangling/issues/341

Wrangler responsible for this dataset/lab:

Enrique

Description of the task:

  • Update old less specific 10x v2 sequencing ontology (EFO:0009310) to the newer more specific 10x 3'/5' v2 sequencing ontology (EFO:0009899/EFO:0009900). This is currently dependent on when pipeline change their subscription queries: Update 10x subscription query HumanCellAtlas/secondary-analysis#800

  • Update file_format field from "fastq.gz" to "fastq". This is a file metadata update and is NOT a simple update.

This spreadsheet has a tab named Project - Publication instead of Project - Publications and thus publication is not displayed in the Browser as this isn't parsed.

  • Rename Project - Publication to Project - Publications in the spreadsheet
  • Change institute name of some collaborators' institutes and move to department
  • Change "European Bioinformatics Institute" to "EMBL-EBI"
  • Change 'Department of Medicine, University of Cambridge' to 'University of Cambridge' in the institution field, and shift the department name into department field

For donors F16, F17, F35, F38, F41, F45 fix the following fields:

  • donor_organism.development_stage.text:
    Fetus stage 1
    Fetus stage 2
    Fetus stage 1
    Fetus stage 3
    Fetus stage 4
    Fetus stage 3
  • donor_organism.development_stage.ontology:
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
  • donor_organism.development_stage.ontology_label:
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
  • donor_organism.gestational_age:
    8.14
    9.14
    7.85
    12
    16
    13.85

For donors F16, F17, F35, F38, F41, F45 delete values for the following field:

  • donor_organism.organism_age
    This field counts from birth so is irrelevant in the case of developmental samples

For specimens F16_1 and F17_1:

  • collection_protocol.protocol_core.protocol_id:
    Change ID to fetal_kidney_collection

For library preparation protocol:

  • library_preparation_protocol.strandedness:
    first

For files:

  • Add content_description field and fill them

  • review donor_organism.diseases.ontology_label and donor_organism.death.cause_of_death for consistency

  • change polyA RNA to polyA RNA extract

For supplementary files:

de-capitalise (To match the supplementary filenames currently in the DSS) the names of:

  • Mature_kidney_declined_transplant_collection.pdf
  • Mature_kidney_tumour_nephrectomy_collection.pdf
  • Fetal_kidney_collection.pdf
  • Mature_kidney_dissociation.pdf
  • Fetal_kidney_dissociation.pdf
  • Mature_kidney_enrichment_2.pdf
  • Fetal_flow_enrichment_1.pdf
  • Fetal_flow_enrichment_2.pdf

Acceptance criteria for the task:

  • all fields are fixed in the spreadsheet
  • an end-to-end review is done for the whole dataset (to make sure there is no further mistakes)
  • dataset is AUDRed in prod with simple updates
@lauraclarke lauraclarke added task A wrangler task operations This issue is an operational task labels Jul 2, 2020
@lauraclarke
Copy link
Contributor Author

From @ESapenaVentura

specimen_from_organism.organ_parts.ontology:UBERON:0000056 [ureter] is not a descendant of
specimen_from_organism.organ.ontology:UBERON:0002113 [kidney]

Thanks for the hard work jahilton , it seems that we need to discuss what is admitted and what not. Ureter is clearly a part of the kidney, and it is related through ontologies by contributes_to_morphology_of: renal pelvis, which in turn it's a subclassOf of kidney.

I don't know what kind of checks we should do here, but it is probably good to ask an expert on ontologies for help.

Also, while preparing some things, I discovered another AUDR:

Samples F16 and F17 are linked to mature_tumour_nephrectomy_collection collection protocol when they should be linked to fetal_kidney_collection
This is a simple update and shouldn't make it harder to AUDR this dataset

@lauraclarke
Copy link
Contributor Author

From @rays22

d5410c6e-612d-421a-a66f-2de5e04dd050 has failed validation test: https://github.com/ebi-ait/ingest-graph-validator/tree/master/graph_test_set/protocol_document_has_supplementary_file.adoc
Test description:
If a protocol defines a supplementary file as the document describing it, there must exist a supplementary file that is named exactly the same.
If any filenames are returned by this test, that means those files are missing as they are specified in protocols and do not exist.

The other submission UUIDs failed to load into a graph database.

@lauraclarke lauraclarke added the audr This dataset needs to be edited/updated label Jul 2, 2020
@lauraclarke lauraclarke added dataset All dataset tickets should have this label, only one ticket per dataset and removed operations This issue is an operational task labels Nov 4, 2020
@clairerye
Copy link
Contributor

I think this project should wait for bulk/spreadsheet updates to be possible

@ami-day ami-day added the internally blocked Issue blocked by something within DCP label Jun 10, 2021
@ofanobilbao ofanobilbao added DCP1.0 Label for datasets in DCP1.0 and removed internally blocked Issue blocked by something within DCP labels Sep 8, 2021
@idazucchi
Copy link
Collaborator

idazucchi commented Feb 3, 2022

Done - Exported by Jacob as part of #334


Done - Was it exported ?

I think that the donor updates where never exported because they were done in june 2020, 3 months after the export of the second submission

This spreadsheet has a tab named Project - Publication instead of Project - Publications and thus publication is not displayed in the Browser as this isn't parsed.

  • Rename Project - Publication to Project - Publications in the spreadsheet
  • Change institute name of some collaborators' institutes and move to department
  • Change "European Bioinformatics Institute" to "EMBL-EBI"
  • Change 'Department of Medicine, University of Cambridge' to 'University of Cambridge' in the institution field, and shift the department name into department field

For donors F16, F17, F35, F38, F41, F45 fix the following fields:

  • donor_organism.development_stage.text:
    Fetus stage 1
    Fetus stage 2
    Fetus stage 1
    Fetus stage 3
    Fetus stage 4
    Fetus stage 3
  • donor_organism.development_stage.ontology:
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
    HsapDv:0000003
  • donor_organism.development_stage.ontology_label:
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
    Carnegie stage 01
  • donor_organism.gestational_age:
    8.14
    9.14
    7.85
    12
    16
    13.85

For donors F16, F17, F35, F38, F41, F45 delete values for the following field:

  • donor_organism.organism_age
    This field counts from birth so is irrelevant in the case of developmental samples

  • review donor_organism.diseases.ontology_label and donor_organism.death.cause_of_death for consistency

For library preparation protocol:

  • library_preparation_protocol.strandedness:
    first
  • change polyA RNA to polyA RNA extract

To Do:

For specimens F16_1 and F17_1:

  • collection_protocol.protocol_core.protocol_id:
    Change ID to fetal_kidney_collection uuid: f44d6b65-0866-47d5-bfb2-8173b1b6a2b3
    F16 8ef48580-339e-4c23-b5bd-04c8692cab5f --> 17aea4fd-cd72-4c36-8ded-d686162804b4
    F17 9fde0deb-9630-4007-aea6-e7a7b1e68c2a --> 5fbd4bb4-e761-4345-99b8-bd5e141a8b94

For protocols:
de-capitalise (To match the protocol_core.document filenames currently in the submission) the names of:

  • Mature_kidney_declined_transplant_collection.pdf
  • Mature_kidney_tumour_nephrectomy_collection.pdf
  • Fetal_kidney_collection.pdf
  • Mature_kidney_dissociation.pdf
  • Fetal_kidney_dissociation.pdf
  • Mature_kidney_enrichment_2.pdf
  • Fetal_flow_enrichment_1.pdf
  • Fetal_flow_enrichment_2.pdf

For files:

  • Add content_description field and fill them
    sequence file : DNA sequence data:3494 DNA sequence
    supplementary file : Protocol data:2531 Protocol

@Wkt8
Copy link
Collaborator

Wkt8 commented Feb 8, 2022

@idazucchi what's the status on this?

@idazucchi
Copy link
Collaborator

Almost done: I've done a bulk update to fix the file names in the protocols and the content description for all the files.
The metadata is validating

@jacobwindsor
Copy link
Contributor

If this is still stuck @MightyAx, I would recommend looking into grafana logs and see what happened. Perhaps it is the same issue and the fix I made didn't fix it

@MightyAx MightyAx assigned MightyAx and unassigned MightyAx Feb 9, 2022
@MightyAx
Copy link
Contributor

MightyAx commented Feb 9, 2022

The file was still stuck in metadata validating, but we aren't exporting files for DCP1 updates only metadata, so this can be safely ignored.

Jacob set the metadata of the file to valid,
graph validation was successful
@idazucchi will export the project
@MightyAx to cleanup the data files from the terra bucket using this cleanup routine

@idazucchi
Copy link
Collaborator

idazucchi commented Feb 9, 2022

The file 4834STDY7002875_S1_L001_R1_001.fastq.gz was stuck in metadata validation but was no trace of the validation job.
The file has been forced to valid.

The project passed graph validation.
I exported the metadata, @MightyAx can you please run the script for the DCP1 projects?

To do:

  • fill export form

@MightyAx
Copy link
Contributor

MightyAx commented Feb 10, 2022

The system is not seeing the export, the project was at status Metadata Valid.
I've triggered graph validation and it is now at status Graph Valid.
@idazucchi Can you click the submit button again (in metadata only mode)?

For clarity, we are definitely talking about this submission to the KidneyCellAtlas

@MightyAx MightyAx added operations This issue is an operational task Dev On Ops dev on ops tickets labels Feb 10, 2022
@MightyAx
Copy link
Contributor

Ida exported successfully
I ran the cleanup with the following (abridged) results:

INFO:__main__:Working on submission d5410c6e-612d-421a-a66f-2de5e04dd050
INFO:__main__:Found project abe1a013-af7a-45ed-8c26-f3793c24a1f4
INFO:__main__:project abe1a013-af7a-45ed-8c26-f3793c24a1f4 is a DCP1 project
Would you like to continue with fixing the terra area for this submission? (Y/n)
[66/66 objects] 100% Done # /metadata/sequence_file
[10/10 objects] 100% Done # /metadata/supplementary_file
[76/76 objects] 100% Done # /descriptors
[22/22 objects] 100% Done # /links

@MightyAx MightyAx added operations This issue is an operational task and removed Dev On Ops dev on ops tickets operations This issue is an operational task labels Feb 10, 2022
@Wkt8
Copy link
Collaborator

Wkt8 commented Feb 15, 2022

@idazucchi can this be moved to finished?

@idazucchi
Copy link
Collaborator

this is a project in #334 so we will de doing the import request with the other datasets from that ticket

@amnonkhen amnonkhen added the Release 14 for datasets targeted at DCP data release 14 label Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audr This dataset needs to be edited/updated dataset All dataset tickets should have this label, only one ticket per dataset DCP1.0 Label for datasets in DCP1.0 operations This issue is an operational task Release 14 for datasets targeted at DCP data release 14 task A wrangler task
Projects
None yet
Development

No branches or pull requests

10 participants