Data Browser shows meaningless strings in Analysis Protocol facet drop-down #51

hannes-ucsc · 2021-11-05T20:44:26Z

Note the analysis_protocol_1 entry.

The text was updated successfully, but these errors were encountered:

hannes-ucsc · 2021-11-05T20:45:56Z

DCP/2 analysis uses protocol_core.protocol_id to identify the workflow as in optimus_v4.2.3. That's the only meaningful property they populate, the rest is boiler-plate. Here's an example:

{
    "computational_method": "Optimus",
    "describedBy": "https://schema.humancellatlas.org/type/protocol/analysis/9.1.0/analysis_protocol",
    "protocol_core": {
        "protocol_id": "optimus_v4.2.3"
    },
    "provenance": {
        "document_id": "54e9804d-958d-584f-aa66-243bcedff6dd",
        "submission_date": "2021-04-17T08:07:00.000000Z",
        "update_date": "2021-04-17T08:07:00.000000Z"
    },
    "schema_type": "protocol",
    "type": {
        "text": "analysis_protocol"
    }
}

Azul indexes that property and the Data Browser displays it various places. The use of an ID for a human-readable display is a hacky but it is what we agreed on during DCP/1.

When organically described CGM's came along, the wranglers used generic sequential ID for the protocols that are attached to the approximate process instance. The use of generic (meaning-less) IDs has been a long-term practices for many wrangler-allocated IDs, e.g. biomaterial_1, cell_suspension_3. Nothing wrong with that either, except when the lab already allocated IDs in which case the wranglers should use the lab-allocated IDs instead of minting their own.

It's just that these two practices starting to collide when organic CGM were introduced.

Here's an example of a organic CGM analysis protocol:

{
    "describedBy": "https://schema.humancellatlas.org/type/protocol/analysis/9.2.0/analysis_protocol",
    "schema_type": "protocol",
    "protocol_core": {
        "protocol_id": "analysis_protocol_1",
        "protocol_name": "Cellranger",
        "protocol_description": "The 10X Genomics Cell Ranger pipeline (version 5.0.1) was used to perform sample demultiplexing, alignment to the hg38 human reference genome (refdata-gex-GRCh38-2020-A, 10x Genomics), barcode/UMI processing, and gene counting for each cell."
    },
    "type": {
        "text": "data transformation",
        "ontology": "OBI:0200000",
        "ontology_label": "data transformation"
    },
    "computational_method": "Cellranger mkfastq",
    "matrix": {
        "data_normalization_methods": [
            "other"
        ],
        "derivation_process": [
            "alignment"
        ]
    },
    "provenance": {
        "document_id": "ea6ce706-6c92-4b80-8522-55b86b676083",
        "submission_date": "2021-08-11T22:51:23.658Z",
        "update_date": "2021-08-11T22:51:29.099Z",
        "schema_major_version": 9,
        "schema_minor_version": 2
    }
}

hannes-ucsc · 2021-11-05T21:01:09Z

The solution is to reconcile the differences. The protocol_id that's used for organic CGMs is too generic to have any utility because it doesn't allow metadata consumers to correlate analysis protocols between projects. If two projects use the same protocol, the corresponding protocol_id values should be identical. As things are right now, one could be analysis_protocol_1 and the other could be analysis_protocol_2, hiding the identity. Worse, if two different protocols were used (CellRanger and FooBar) both could easily use analysis_protocol_1, falsely indicating identity.

Wranglers should use meaningful values for protocol_id and coordinate between each other to use those values consistently accross projects.

DCP/2 analyses should populate protocol_name with a human-readable, unique name for the protocol.

TL;DR:

In the above organic CGM protocol example, protocol_id should be cellranger_5.0.1 and the protocol_name should be CellRanger 5.0.1. In the DCP/2 analysis protocol example, protocol_name should be Optimus 4.2.3. Azul and DB should switch to using protocol_name.

hannes-ucsc · 2021-11-23T16:50:52Z

@nikellepetrillo
@ami-day

nikellepetrillo · 2021-11-29T18:51:26Z

@hannes-ucsc We are a little worried about the space in the project name "Optimus 4.2.3". Do you forsee any issues with that? If the space works for you, that is fine for us, I just wanted to call attention to it

hannes-ucsc · 2021-11-29T19:02:42Z

We will treat protocol_core.protocol_name as the the user-friendly, human-readable form of protocol_core.protocol_id so a space should be used there to separate words. I do not foresee any issues with that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Browser shows meaningless strings in Analysis Protocol facet drop-down #51

Data Browser shows meaningless strings in Analysis Protocol facet drop-down #51

hannes-ucsc commented Nov 5, 2021 •

edited

Loading

hannes-ucsc commented Nov 5, 2021 •

edited

Loading

hannes-ucsc commented Nov 5, 2021

hannes-ucsc commented Nov 23, 2021

nikellepetrillo commented Nov 29, 2021

hannes-ucsc commented Nov 29, 2021

Data Browser shows meaningless strings in Analysis Protocol facet drop-down #51

Data Browser shows meaningless strings in Analysis Protocol facet drop-down #51

Comments

hannes-ucsc commented Nov 5, 2021 • edited Loading

hannes-ucsc commented Nov 5, 2021 • edited Loading

hannes-ucsc commented Nov 5, 2021

hannes-ucsc commented Nov 23, 2021

nikellepetrillo commented Nov 29, 2021

hannes-ucsc commented Nov 29, 2021

hannes-ucsc commented Nov 5, 2021 •

edited

Loading

hannes-ucsc commented Nov 5, 2021 •

edited

Loading