Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Browser shows meaningless strings in Analysis Protocol facet drop-down #51

Open
hannes-ucsc opened this issue Nov 5, 2021 · 5 comments

Comments

@hannes-ucsc
Copy link
Collaborator

hannes-ucsc commented Nov 5, 2021

image

Note the analysis_protocol_1 entry.

@hannes-ucsc
Copy link
Collaborator Author

hannes-ucsc commented Nov 5, 2021

DCP/2 analysis uses protocol_core.protocol_id to identify the workflow as in optimus_v4.2.3. That's the only meaningful property they populate, the rest is boiler-plate. Here's an example:

{
    "computational_method": "Optimus",
    "describedBy": "https://schema.humancellatlas.org/type/protocol/analysis/9.1.0/analysis_protocol",
    "protocol_core": {
        "protocol_id": "optimus_v4.2.3"
    },
    "provenance": {
        "document_id": "54e9804d-958d-584f-aa66-243bcedff6dd",
        "submission_date": "2021-04-17T08:07:00.000000Z",
        "update_date": "2021-04-17T08:07:00.000000Z"
    },
    "schema_type": "protocol",
    "type": {
        "text": "analysis_protocol"
    }
}

Azul indexes that property and the Data Browser displays it various places. The use of an ID for a human-readable display is a hacky but it is what we agreed on during DCP/1.

When organically described CGM's came along, the wranglers used generic sequential ID for the protocols that are attached to the approximate process instance. The use of generic (meaning-less) IDs has been a long-term practices for many wrangler-allocated IDs, e.g. biomaterial_1, cell_suspension_3. Nothing wrong with that either, except when the lab already allocated IDs in which case the wranglers should use the lab-allocated IDs instead of minting their own.

It's just that these two practices starting to collide when organic CGM were introduced.

Here's an example of a organic CGM analysis protocol:

{
    "describedBy": "https://schema.humancellatlas.org/type/protocol/analysis/9.2.0/analysis_protocol",
    "schema_type": "protocol",
    "protocol_core": {
        "protocol_id": "analysis_protocol_1",
        "protocol_name": "Cellranger",
        "protocol_description": "The 10X Genomics Cell Ranger pipeline (version 5.0.1) was used to perform sample demultiplexing, alignment to the hg38 human reference genome (refdata-gex-GRCh38-2020-A, 10x Genomics), barcode/UMI processing, and gene counting for each cell."
    },
    "type": {
        "text": "data transformation",
        "ontology": "OBI:0200000",
        "ontology_label": "data transformation"
    },
    "computational_method": "Cellranger mkfastq",
    "matrix": {
        "data_normalization_methods": [
            "other"
        ],
        "derivation_process": [
            "alignment"
        ]
    },
    "provenance": {
        "document_id": "ea6ce706-6c92-4b80-8522-55b86b676083",
        "submission_date": "2021-08-11T22:51:23.658Z",
        "update_date": "2021-08-11T22:51:29.099Z",
        "schema_major_version": 9,
        "schema_minor_version": 2
    }
}

@hannes-ucsc
Copy link
Collaborator Author

The solution is to reconcile the differences. The protocol_id that's used for organic CGMs is too generic to have any utility because it doesn't allow metadata consumers to correlate analysis protocols between projects. If two projects use the same protocol, the corresponding protocol_id values should be identical. As things are right now, one could be analysis_protocol_1 and the other could be analysis_protocol_2, hiding the identity. Worse, if two different protocols were used (CellRanger and FooBar) both could easily use analysis_protocol_1, falsely indicating identity.

Wranglers should use meaningful values for protocol_id and coordinate between each other to use those values consistently accross projects.

DCP/2 analyses should populate protocol_name with a human-readable, unique name for the protocol.

TL;DR:

In the above organic CGM protocol example, protocol_id should be cellranger_5.0.1 and the protocol_name should be CellRanger 5.0.1. In the DCP/2 analysis protocol example, protocol_name should be Optimus 4.2.3. Azul and DB should switch to using protocol_name.

@hannes-ucsc
Copy link
Collaborator Author

@nikellepetrillo
@ami-day

@nikellepetrillo
Copy link

@hannes-ucsc We are a little worried about the space in the project name "Optimus 4.2.3". Do you forsee any issues with that? If the space works for you, that is fine for us, I just wanted to call attention to it

@hannes-ucsc
Copy link
Collaborator Author

We will treat protocol_core.protocol_name as the the user-friendly, human-readable form of protocol_core.protocol_id so a space should be used there to separate words. I do not foresee any issues with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants