Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move metadata CV terms to JSON file for validation and automatic docs generation #67

Merged
merged 2 commits into from
Sep 8, 2023

Conversation

RalfG
Copy link
Collaborator

@RalfG RalfG commented May 26, 2023

This PR moves the metadata terms from the Google Doc to a JSON document according to the validator schema for metadata. This JSON file can be used for validation and for automatic documentation generation for mzSpecLib metadata CV terms.

A current version of the auto-generated documentation can be viewed here: https://github.com/HUPO-PSI/mzSpecLib/blob/891fbdb4463ea023e1b51830ddded87509e33f23/docs/metadata-rules.md

I first moved all data from the Google Doc to an Excel sheet for easy parsing. The original Excel sheet and script for parsing to JSON is still in a tmp directory. It can be removed once this PR is ready to be merged.

The rules schema has been updated to include definition and units fields.

To do's and open questions:

  • How to merge this (single) JSON file with the validation rules written by @mobiusklein?
  • Update the new JSON file to include combinatorial logic for validation
  • Add more higher-level descriptions on the different levels for writing to the markdown documentation (making the generated document more human-friendly and accessible).
  • Do we keep the term definition in the JSON file or do we only fetch this information when parsing the JSON to markdown documentation? -> Fetch information from CV while generating documentation.

Copy link
Contributor

@edeutsch edeutsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me, many thanks!
I'm uncertain what to do about the open issues. I think we should discuss in more detail at the next call. Fine to merge and then continue effort from there, or hold it open, whichever.

@mobiusklein
Copy link
Collaborator

I'm not sure we need to explicitly encode the value type constraints as validator rules. The validator already has access to the CV and can look up the value types there, without needing to regenerate the rule table should the CV update. However, the CV does not say where a term is expected in the data model, which this does do.

What we do want to express as validator rules are things like:

  1. Using the non-preferred term for expressing something (e.g. "selected ion m/z" instead of "experimentally determined precursor monoisotopic m/z")
  2. When a term is in XOR relationship (you can specify "proforma peptidoform ion notation" or "proforma peptidoform sequence", but not both)
  3. When a term has extra logic associated with it like "proforma peptidoform ion notation" providing sequence, charge state, and adduct formula(e)

There's a lot to review so I'll go over this again.

Something we probably should do is add some guidelines for the user about the four different ways we can express the "target mass" of a spectrum and how we've fragmented the representation between Spectrum and Analyte. I'll re-post these as separate issues:

  • Spectrum owns experimental precursor monoisotopic m/z, but charge state was emphatically moved to the Analyte on the last call. Spectrum can also hold charge state (but only when it lacks an interpretation?) and/or possible charge state.
  • Analyte owns theoretical mass, adduct ion mass (plus adduct ion formula), theoretical monoisotopic m/z, and theoretical average m/z, as well as charge state. In theory a whole protein might have ambiguous charge states even when assigned fully so possible charge state could map here too.
  • We can express that a spectrum is chimeric, but not the m/z values of the non-selected ion m/z + z values?

@edeutsch
Copy link
Contributor

Discussed briefly on June 23. Need more input from @RalfG

@RalfG
Copy link
Collaborator Author

RalfG commented Jul 10, 2023

Hi @mobiusklein,

I updated the code and data to parse the term definitions and types when generating the documentation, so it is not a part of the rules JSON. This PR therefore does not change anything in the validation rules schema anymore and does not imply type constraints for validation anymore.

To continue from this:
I realize that with this PR, the rules JSON files would serve a dual purpose:

  • Defining rules for validation
  • Define 'recommendations' for metadata terms usage throughout the various levels in an mzSpecLib file.

In this initial effort, I simply converted the Google Doc to a single JSON file, separate from the rules that you had already written. It is just one large list of terms without any additional validation logic (as you described above).

Next up, we need to find a way to merge these somehow. I think this would requires us to:

  1. Define the levels of requirement we want to use: Gold/silver/bronze as defined here before, or must/should as defined in the Google Doc; and if we want to use categories (general/peptide/consensus...)
  2. Assign a level (and potentially category) to each metadata item
  3. Group metadata items by XOR relationships where needed and add the additional logic where required.

What do you think?

Once (1) is done, I can start with (2) and (3). For (3) I will most likely require your input as well.

@RalfG RalfG marked this pull request as ready for review September 8, 2023 15:26
@edeutsch edeutsch merged commit 7534e68 into master Sep 8, 2023
10 checks passed
@edeutsch edeutsch deleted the metadata-docs branch September 8, 2023 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants