Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13

Open
NoopDog opened this issue Apr 14, 2021 · 11 comments
Assignees
Labels
bug [type] A defect preventing use of the system as specified epic [type] Issue consists of multiple smaller issues spike:3 [process] Spike estimate of three points

Comments

@NoopDog
Copy link

NoopDog commented Apr 14, 2021

The Library Construction Approach facet in the data browser has several terms that are either incorrectly labeled or insufficiently specific, cluttering up the list.

For example, the 10x family of library construction approaches the browser lists:

  • 10X 3' v1 sequencing
  • 10x 3' v2
  • 10X 3' v2 sequencing
  • 10x 3' v3 sequencing
  • 10X 3' v3 sequencing
  • 10X 5' v2 sequencing
  • 10X Ig enrichment
  • 10X TCR enrichment
  • 10X v2 sequencing
  • 10x v3 sequencing

See the 20210401_dcp4-Library-Preparation-Protocols Spreadsheet for a report with the full list of library preparation protocol documents used in the metadata.

The above list contains several classes of errors that should be fixed and may require changes to validation or ingest/wrangling SOP to prevent them from happening again.

Note that in addition to TDR snapshots and Azul indexes, the incorrect ontology terms are also likely in the DCP Generated Matrices' embedded metadata. We will need to validate the DCP generated matrices, and if necessary, come up with an efficient approach for updating the metadata.

Expected Outcome

Using the correct and most specific ontology terms available, we should be able to trim the above list to:

  • 10X 3' v1 sequencing
  • 10X 3' v2 sequencing
  • 10x 3' v3 sequencing
  • 10X 5' v2 sequencing
  • 10X Ig enrichment
  • 10X TCR enrichment

Note that since 10X Ig enrichment and 10X TCR enrichment are subclasses of 10X 5' v2 sequencing, we may be able to eliminate 10X 5' v2 sequencing as well.

Background

library_preparation_protocol.library_construction_method is defined to have a graph restriction: Subclasses of OBI:0000711 from obo:efo.

See EBI OLS EFO / OBI_0000711 for the ontology terms we use to define this field.

The value of the library_preparation_protocol.library_construction_method is a library_construction_ontology entity which defines the following fields

Field Description
library_construction_ontology.ontology An ontology term identifier in the form prefix:accession. For example, "EFO:0009310" or "EFO:0008931
library_construction_ontology.ontology_label (string) The preferred label for the ontology term referred to in the ontology field. This may differ from the user-supplied value in the text field. For example "10X v2 sequencing" or "Smart-seq
library_construction_ontology.text (string) The name of a library construction approach being used. For example "10X v2 sequencing" or "Smart-seq2".

When Azul indexes this field, it uses ontology_label if present, text if not. And if neither is present, it's ontology (the term reference).

Error Types

Looking at the spreadsheet above, it appears there are several classes of problems to be addressed:

Type Description Example
1 Incorrect ongology_label e.g. using DroNc-Seq instead of DroNc-seq, 10x 3'v2 instead of 10X 3' v2 sequencing
2 Using ontology identifier when a more specific term is available. e.g using 10X v2 sequencing (EFO:0009310) instead of a more specific term that specifies the end_bias such as (EFO_0009899)
3 mismatch of ontology_label and ontology_term e.g. label is 10X 3' v2 sequencing and text is 10X 5' v2 sequencing (Row 66)

We may also have internal consistency errors that show up with further validation, for example, where the end_bias does not match the ontology term.

Possible Discussion Points

  1. What is the best way to find, report, track, and fix these kinds of errors and create a work queue for resolving them?

  2. Where might we add validation to prevent incorrect ontology terms and labels?

  3. What validations are required, and how might they be specified and implemented? For example:

    1. How can we specify when non-leaf nodes should be disallowed as ontology terms? For example, how could we specify that 10x 5’ v2 sequencing is allowed, but 10x v2 sequencing is not?
    2. What is the purpose of the text field when the ontology label is provided? Should we be concerned when there is an apparent mismatch between the ontology label and the text?
  4. Should we more aggressively use hcao to add terms where they are missing in the core ontologies. For example, to prevent "nulls" in the ontology and ontology text fields.

  5. Can/should we fix the incorrect metadata that has made it into DCP generated matrices.

Notes

The query for the above spreadsheet is listed below. The query could be modified to look for similar errors in other ontologized fields.

SELECT
  protocol_project.project_id,
  library_preparation_protocol_id,
  json_extract_scalar(content,
    "$.library_construction_method.ontology") AS ontology_id,
  json_extract_scalar(content,
    "$.library_construction_method.ontology_label") AS ontology_label,
  json_extract_scalar(content,
    "$.library_construction_method.text") AS text,
  json_extract_scalar(content,
    "$.end_bias") AS end_bias
FROM
  `broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.library_preparation_protocol` AS library_preparation_protocol
FULL JOIN (
  SELECT
    DISTINCT *
  FROM (
    SELECT
      project_id,
      JSON_EXTRACT_SCALAR(protocol,
        "$.protocol_type") AS protocol_type,
      JSON_EXTRACT_SCALAR(protocol,
        "$.protocol_id") AS protocol_id,
    FROM
      `broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.links`
    LEFT JOIN
      UNNEST(JSON_EXTRACT_ARRAY(content,
          "$.links")) AS process
    LEFT JOIN
      UNNEST(JSON_EXTRACT_ARRAY(process,
          "$.protocols")) AS protocol ) AS protocol_project
  WHERE
    protocol_type = "library_preparation_protocol") AS protocol_project
ON
  protocol_project.protocol_id = library_preparation_protocol.library_preparation_protocol_id
ORDER BY
  ontology_id,
  library_preparation_protocol_id
@theathorn theathorn added orange [process] Done by the Azul team bug [type] A defect preventing use of the system as specified labels Apr 14, 2021
@theathorn theathorn added the spike:3 [process] Spike estimate of three points label Apr 14, 2021
@mshadbolt
Copy link

just wanted to note that 10X 5' v2 sequencing is specifically gene expression whereas the sub-terms are enrichment of that kind of library, so we can't remove that term.

@hannes-ucsc
Copy link
Collaborator

hannes-ucsc commented Apr 21, 2021

I thought the sub-term relation expressed an "is a" relation. If that's the case, we would need a more specific sub term.

If we want to represent an apple, and the fruit term only has pear and orange sub-terms, then we shouldn't use fruit but instead add apple as a sub-term of fruit and use that.

@NoopDog
Copy link
Author

NoopDog commented Apr 22, 2021

@theathorn @hannes-ucsc @mshadbolt I added possible discussion points above for when we gather to discuss.

@theathorn
Copy link

Slack thread.

@theathorn theathorn changed the title Metadatata contains incorrect values for library_preparation_protocol.library_construction_method Metadata contains incorrect values for library_preparation_protocol.library_construction_method May 3, 2021
@ESapenaVentura
Copy link
Contributor

ESapenaVentura commented May 5, 2021

I thought the sub-term relation expressed an "is a" relation. If that's the case, we would need a more specific sub term.

If we want to represent an apple, and the fruit term only has pear and orange sub-terms, then we shouldn't use fruit but instead add apple as a sub-term of fruit and use that.

10x 5' (v2 and v3) represent the gene expression from 5' end of the whole set of transcripts of a cell. 5' Ig and TCR enrichment are taking a part of that whole set of transcripts and enriching for the sequences that are translated as T cell receptors and Immunoglobulins; these transcripts have sequences that can be identified and, therefore, can be enriched and separated.

Following the analogy, it would be more of a situation of representing an apple and its seeds. The seeds are a part of the apple, but you still need the apple term to represent the whole.

Happy to discuss if we need to change the way we label it to make it more clear, but I don't think the current way is incorrect

@hannes-ucsc
Copy link
Collaborator

I see, the old "is a" vs "part of" conflation. How do I as a consumer of the ontology distinguish between term relationships that represent inheritance (is a) vs ones that express an aggregation (part of)?

@kbergin
Copy link
Collaborator

kbergin commented May 6, 2021

Just a brief comment - in your main post, in the 'expected ontology' narrowed down list, these two are the same except capitalizations, so they can be condensed.

10x 3' v3 sequencing
10X 3' v3 sequencing

@NoopDog
Copy link
Author

NoopDog commented May 7, 2021

Notes from our May 6 call

We agreed that:

For problem 1 above - incorrect ontology label

  1. Wranglers will update the metadata in an upcoming release.
  2. Azul team will investigate resolving ontology labels from the ontology term ID during/prior to indexing.

For problem 2 above - non-leaf ontology terms used

  1. Wranglers will update the non-leaf terms to the leaf term in an upcoming release.

For missing ontologies

  1. Wranglers will backfill any null library_construction_ontology.ontology where an appropriate ontology term now exists.

To track the wrangler work

  1. We will create tickets in the HCA EBI Wrangler Central GitHub repository.

To add traceability to these DCP-wide activities

  1. Epics will be created in the DCP2 (this) repo and we will link the epics to the Wrangler Central issues.

TODO

  1. We need to determine if and how we will update the DCP-generated matrices affected by the metadata updates.

@NoopDog
Copy link
Author

NoopDog commented May 25, 2021

Notes from May 25 Call

  • Azul to look up ontology terms at indexing time from ontology ids. Use ontology id to lookup label DataBiosphere/azul#3076
  • Azul to return ontology ids along with term names in Azul responses so the data browser can display the term on hover and link out to the term definition. Add ontology id to entities response DataBiosphere/azul#3078
  • TODO Confirm that library construction approaches are mutually exclusive, e.g. only one would appear where the term is used and that the relationship between child terms and parent terms is "is a".
  • TODO Close tickets in wrangler central related to incorrect ontology labels as these will be fixed in the browser by Use ontology id to lookup label DataBiosphere/azul#3076
  • TODO Schedule remaining wrangler central tickets to a data release. (using an ontology where none existed, using the leaf term)
  • TODO Determine if EFO/OBI supports versioning or what the recommended practice is to say we are using the ontology as it existed on a given date.
  • Convert this issue to a ZenHub epic and make move blocking tickets to sub-tickets.

@hannes-ucsc
Copy link
Collaborator

Epics shouldn't be blocked by the individual tickets, otherwise the filtering by Epic doesn't work. Those tickets should be part of the epic. I'll fix this.

@hannes-ucsc
Copy link
Collaborator

There are no orange tickets in this epic. Taking off the orange label.

Many of the issues in this epic were closed in favor of a programmatic solution (epic #3079).

@hannes-ucsc hannes-ucsc removed the orange [process] Done by the Azul team label Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug [type] A defect preventing use of the system as specified epic [type] Issue consists of multiple smaller issues spike:3 [process] Spike estimate of three points
Projects
None yet
Development

No branches or pull requests

6 participants