Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release #18

Merged
merged 2 commits into from
Apr 8, 2024
Merged

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Mar 22, 2024

addresses #16

Changes

SSSOM outputs
- Add: Code for 2 new outputs: hpo-umls.sssom.tsv and hpo-mesh.sssom.tsv

General
- Add: Python requirement: PyYaml
- Update: ORCID in SSSOM config
- Bug fix: Edge case for missing intermediate file in pipeline
- Remove: Intentionally duplicative, no longer needed 'MEDGENCUI' xrefs
- Add: Refactor to load mapping set using function that handles lots of repetitive stuff: get_mapping_set()

@joeflack4 joeflack4 changed the title SSSOM SSSOM outputs: HPO & MeSH Mar 23, 2024
@joeflack4 joeflack4 requested review from matentzn and twhetzel March 23, 2024 20:03
@joeflack4
Copy link
Contributor Author

CC @souzadevinicius Just adding some SSSOM outputs for the MedGen ingest.

@joeflack4 joeflack4 self-assigned this Mar 23, 2024
@joeflack4 joeflack4 added the enhancement New feature or request label Mar 23, 2024
@joeflack4 joeflack4 linked an issue Mar 23, 2024 that may be closed by this pull request
@joeflack4 joeflack4 changed the title SSSOM outputs: HPO & MeSH Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release #16 Mar 23, 2024
@joeflack4 joeflack4 changed the title Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release #16 Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release Mar 23, 2024
src/utils.py Show resolved Hide resolved
@joeflack4 joeflack4 changed the title Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release SSSOM outputs && robot template tweaks Mar 23, 2024
@joeflack4 joeflack4 force-pushed the sssom branch 3 times, most recently from fe3fecb to 4ac6825 Compare March 24, 2024 04:00
makefile Show resolved Hide resolved
@joeflack4 joeflack4 changed the title SSSOM outputs && robot template tweaks SSSOM outputs Mar 24, 2024
src/utils.py Outdated
os.remove(temp_filtered_config_path)


def write_tsv_with_comments(df: pd.DataFrame, comments_file: Union[Path, str], outpath: Union[Path, str]):
Copy link
Contributor Author

@joeflack4 joeflack4 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General: write_tsv_with_comments()

@matentzn Let me know what the existing best practice is supposed to be for this.
I have this function (and several others it's used with) in the repos for GARD, MedGen, ICD11, a TIMS repo, and possibly elsewhere. Maybe there's a function in sssom-py that does this like I want? I have a dataframe and a path to a sssom-metadata.yml, and I want to write that to a file and include the comment at the top.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep,

  1. instantiate a MappingSetDataFrame() object with the dataframe
  2. use write_table() method to frite the result

See example: https://github.com/mapping-commons/sssom-py/blob/be8a77b5a2ef42cd2b92b74d892d7626dec27471/src/sssom/cli.py#L280

Copy link
Contributor Author

@joeflack4 joeflack4 Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall this a bit now, thanks. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done~

Now my write_sssom() function has been rewritten to utilize sssom-py:

def write_sssom(df: pd.DataFrame, config_path: Union[Path, str], outpath: Union[Path, str]):
    """Writes a SSSOM file"""
    with open(config_path, 'r') as yaml_file:
        metadata: Dict = yaml.load(yaml_file, Loader=yaml.FullLoader)
    converter = curies.Converter.from_prefix_map(metadata['curie_map'])
    msdf: MappingSetDataFrame = MappingSetDataFrame(converter=converter, df=df, metadata=metadata)
    with open(outpath, 'w') as f:
        write_table(msdf, f)

As with my icd11 PR which had the same refactor, while working on this I also encountered some possible bugs and UX issues, and made:

src/utils.py Outdated Show resolved Hide resolved
@@ -39,3 +39,5 @@ jobs:
output/release/medgen.obo
output/release/medgen-disease-extract.obo
output/release/medgen-xrefs.robot.template.tsv
output/release/hpo-umls.sssom.tsv
output/release/hpo-mesh.sssom.tsv
Copy link
Contributor Author

@joeflack4 joeflack4 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSSOM Outputs

I ran a new build and created a release which includes these. Here are some sample rows.

Questions

  1. Double checking that you 100% want only UMLS mappings
    Should MedGen-derived prefixes just be "UMLS", or one of ("UMLS", "MEDGENCUI", or "MEDGEN") if/where applicable?
    UMLS: Compose 100% of the subjects in hpo-umls.sssom.tsv currently.
    MEDGENCUI: Note that there were no MEDGENCUI<->HPO mappings in this set in MedGenIDMappings.txt. Also note that I did see your instruction in Finalise MedGen xref table #15: Remove all rows with MEDGENCUI in it (we only need xrefs to MEDGEN:123 and UMLS:123) (please see & respond to my related question in that issue). However, applied our previous logic discussed in "Refactoring > Prefix assignment", and verified afterwards that there was no MEDGENCUI.
    MEDGEN: These are IDs that have neither CN or C at the beginning; they are UIDs. I did not check to see if we could actually get HPO<->MEDGEN mappings. Do you want them?
  2. Subject/object order & filename
    I think I renamed hpo-umls.sssom.tsv incorrectly. Just realizing this now. I have the subjects as UMLS. But HPO comes first in the filename; so perhaps I should make it so that the position is consistent? Does it matter to you which is the subject and which is the object?
  3. Are my columns good?
    I could include mapping_justification, but I don't know their process, and I don't know if we can say if there is or isn't variability in how the do their mappings.

hpo-umls.sssom.tsv

subject_id subject_label predicate_id object_id
UMLS:C0000727 Acute abdomen skos:exactMatch HP:0033400
UMLS:C0000729 Abdominal cramps skos:exactMatch HP:0032155
UMLS:C0000731 Abdominal distention skos:exactMatch HP:0003270

hpo-mesh.sssom.tsv

subject_id predicate_id object_id
HP:0000003 skos:exactMatch MESH:D021782
HP:0000005 skos:exactMatch MESH:D040582
HP:0000011 skos:exactMatch MESH:D001750

This one doesn't have subject_label or object_label because I couldn't guarantee that the label in MedGenIDMappings.txt was an accurate reflection of either source in 100% of cases. I imagine these labels are coming either from UMLS or from MedGen.

hpo-mesh-no-matches.sssom.tsv

hpo-mesh-no-matches.sssom.tsv.zip

subject_id predicate_id object_id umls_id umls_label
skos:exactMatch MESH:C000591739 UMLS:C1970109 Aromatase excess syndrome
skos:exactMatch MESH:C000596385 UMLS:C3495589 Jalili syndrome
skos:exactMatch MESH:C000597569 UMLS:C4042185 Teratoid Rhabdoid Tumor

Not included in the release. This is for analysis. I just wanted you to see the cases where there were no matches. There were only about 2,300 HPO<->MeSH mappings that could be derived out of the ~16,000 MeSH terms that have UMLS mappings. I included umls_id and umls_label in this set for review. It also includes all of the rows for all of the matches; not just the non-matches. I sorted the non-matches to the top.

Copy link
Member

@matentzn matentzn Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking that you 100% want only UMLS mappings

For now yes. Maybe in the future (but not now) we will look for medgen mappings as well, but for now, we only want HPO-UMLS mappings.

  • @joeflack4 Ensure this is done (probably already so)

I think I renamed hpo-umls.sssom.tsv incorrectly. Just realizing this now. I have the subjects as UMLS. But HPO comes first in the filename; so perhaps I should make it so that the position is consistent? Does it matter to you which is the subject and which is the object?

Its cosmetic, ask your own inner sense of style. I like the order, but wont insist on it if it takes a lot of time to fix

I could include mapping_justification, but I don't know their process, and I don't know if we can say if there is or isn't variability in how the do their mappings.

Please use sssom toolkit as you want to release a valid sssom file. mapping_justification is mandatory. Since its MedGen I would tend to use semapv:ManualMappingCuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited your comment w/ checkboxes. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done!

@joeflack4 joeflack4 changed the title SSSOM outputs Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release Mar 24, 2024
@joeflack4 joeflack4 force-pushed the sssom branch 6 times, most recently from 91e9e37 to bff4784 Compare March 30, 2024 00:09
- Add: Code for 2 new outputs: hpo-umls.sssom.tsv and hpo-mesh.sssom.tsv

General
- Add: Python requirement: PyYaml
- Update: ORCID in SSSOM config
- Bug fix: Edge case for missing intermediate file in pipeline
- Remove: Intentionally duplicative, no longer needed 'MEDGENCUI' xrefs
- Add: Refactor to load mapping set using function that handles lots of repetitive stuff: get_mapping_set()
- Update: Utilizing standard sssom-py functionality
- Add: sssom validate
- Add: mapping_justification
- Update: Subject/object order & filename
- Add: Mandatory filtering of UMLS matches only in HPO-UMLS SSSOM.
#medgen.sssom.tsv: medgen.obographs.json
# sssom parse medgen.obographs.json -I obographs-json -m config/medgen.sssom-metadata.yml -o $@
sssom: umls-hpo.sssom.tsv
sssom validate umls-hpo.sssom.tsv
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSSOM validation

As with the ICD11 PR, I also added SSSOM validation here.

@joeflack4
Copy link
Contributor Author

@matentzn @twhetzel All's good with this one now as well, I believe.

Copy link
Member

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not do a detailed code review, but what I saw looks good! Thanks!

@joeflack4
Copy link
Contributor Author

@matentzn Roger.
@twhetzel This has been amply reviewed; merging now.

@joeflack4 joeflack4 merged commit 0b8e063 into main Apr 8, 2024
@joeflack4 joeflack4 deleted the sssom branch April 8, 2024 22:27
This was referenced Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release
2 participants