Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release #18

Merged
merged 2 commits into from
Apr 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/buid_and_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,5 @@ jobs:
output/release/medgen.obo
output/release/medgen-disease-extract.obo
output/release/medgen-xrefs.robot.template.tsv
output/release/umls-hpo.sssom.tsv
output/release/hpo-mesh.sssom.tsv
Copy link
Contributor Author

@joeflack4 joeflack4 Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSSOM Outputs

I ran a new build and created a release which includes these. Here are some sample rows.

Questions

  1. Double checking that you 100% want only UMLS mappings
    Should MedGen-derived prefixes just be "UMLS", or one of ("UMLS", "MEDGENCUI", or "MEDGEN") if/where applicable?
    UMLS: Compose 100% of the subjects in hpo-umls.sssom.tsv currently.
    MEDGENCUI: Note that there were no MEDGENCUI<->HPO mappings in this set in MedGenIDMappings.txt. Also note that I did see your instruction in Finalise MedGen xref table #15: Remove all rows with MEDGENCUI in it (we only need xrefs to MEDGEN:123 and UMLS:123) (please see & respond to my related question in that issue). However, applied our previous logic discussed in "Refactoring > Prefix assignment", and verified afterwards that there was no MEDGENCUI.
    MEDGEN: These are IDs that have neither CN or C at the beginning; they are UIDs. I did not check to see if we could actually get HPO<->MEDGEN mappings. Do you want them?
  2. Subject/object order & filename
    I think I renamed hpo-umls.sssom.tsv incorrectly. Just realizing this now. I have the subjects as UMLS. But HPO comes first in the filename; so perhaps I should make it so that the position is consistent? Does it matter to you which is the subject and which is the object?
  3. Are my columns good?
    I could include mapping_justification, but I don't know their process, and I don't know if we can say if there is or isn't variability in how the do their mappings.

hpo-umls.sssom.tsv

subject_id subject_label predicate_id object_id
UMLS:C0000727 Acute abdomen skos:exactMatch HP:0033400
UMLS:C0000729 Abdominal cramps skos:exactMatch HP:0032155
UMLS:C0000731 Abdominal distention skos:exactMatch HP:0003270

hpo-mesh.sssom.tsv

subject_id predicate_id object_id
HP:0000003 skos:exactMatch MESH:D021782
HP:0000005 skos:exactMatch MESH:D040582
HP:0000011 skos:exactMatch MESH:D001750

This one doesn't have subject_label or object_label because I couldn't guarantee that the label in MedGenIDMappings.txt was an accurate reflection of either source in 100% of cases. I imagine these labels are coming either from UMLS or from MedGen.

hpo-mesh-no-matches.sssom.tsv

hpo-mesh-no-matches.sssom.tsv.zip

subject_id predicate_id object_id umls_id umls_label
skos:exactMatch MESH:C000591739 UMLS:C1970109 Aromatase excess syndrome
skos:exactMatch MESH:C000596385 UMLS:C3495589 Jalili syndrome
skos:exactMatch MESH:C000597569 UMLS:C4042185 Teratoid Rhabdoid Tumor

Not included in the release. This is for analysis. I just wanted you to see the cases where there were no matches. There were only about 2,300 HPO<->MeSH mappings that could be derived out of the ~16,000 MeSH terms that have UMLS mappings. I included umls_id and umls_label in this set for review. It also includes all of the rows for all of the matches; not just the non-matches. I sorted the non-matches to the top.

Copy link
Member

@matentzn matentzn Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double checking that you 100% want only UMLS mappings

For now yes. Maybe in the future (but not now) we will look for medgen mappings as well, but for now, we only want HPO-UMLS mappings.

  • @joeflack4 Ensure this is done (probably already so)

I think I renamed hpo-umls.sssom.tsv incorrectly. Just realizing this now. I have the subjects as UMLS. But HPO comes first in the filename; so perhaps I should make it so that the position is consistent? Does it matter to you which is the subject and which is the object?

Its cosmetic, ask your own inner sense of style. I like the order, but wont insist on it if it takes a lot of time to fix

I could include mapping_justification, but I don't know their process, and I don't know if we can say if there is or isn't variability in how the do their mappings.

Please use sssom toolkit as you want to release a valid sssom file. mapping_justification is mandatory. Since its MedGen I would tend to use semapv:ManualMappingCuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited your comment w/ checkboxes. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done!

joeflack4 marked this conversation as resolved.
Show resolved Hide resolved
joeflack4 marked this conversation as resolved.
Show resolved Hide resolved
joeflack4 marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ local development situations / debugging.
- a. `docker pull obolibrary/odkfull:latest`
- b. `docker pull obolibrary/odkfull:dev`

## Setup
## Local development setup
1. Give permission to run Perl: `chmod +x ./bin/*.pl`
2. Install Python dependencies: `pip install -r requirements.txt`

Expand Down
3 changes: 2 additions & 1 deletion config/medgen.sssom-metadata.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
creator_id: 0000-0002-2906-7319
creator_id: orcid:0000-0002-2906-7319
curie_map:
GTR: http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/GTR/
HP: http://purl.obolibrary.org/obo/HP_
Expand All @@ -9,6 +9,7 @@ curie_map:
NCIT: http://purl.obolibrary.org/obo/NCIT_
OMIM: https://omim.org/entry/
Orphanet: http://www.orpha.net/ORDO/Orphanet_
orcid: https://orcid.org/
SCTID: http://identifiers.org/snomedct/
UMLS: http://purl.obolibrary.org/obo/UMLS_
oboInOwl: http://www.geneontology.org/formats/oboInOwl#
Expand Down
36 changes: 23 additions & 13 deletions makefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# MedGen ingest
# Running `make all` will run the full pipeline. Note that if the FTP files have already been downloaded, it'll skip
# that part. In order to force re-download, run `make all -B`.
# todo: remove parts of old make/perl pipeline no longer used
.DEFAULT_GOAL := all
.PHONY: all build stage stage-% analyze clean deploy-release build-lite minimal
.PHONY: all build stage stage-% analyze clean deploy-release build-lite minimal sssom

OBO=http://purl.obolibrary.org/obo
PRODUCTS=medgen-disease-extract.obo medgen-disease-extract.owl
Expand All @@ -14,10 +15,10 @@ minimal: build-lite stage-lite clean
stage-lite: | output/release/
# mv medgen-disease-extract.owl output/release/
# mv medgen.sssom.tsv output/release/
mv medgen.obo output/release/
mv medgen-disease-extract.obo output/release/
mv medgen-xrefs.robot.template.tsv output/release/
build-lite: medgen-disease-extract.obo medgen-xrefs.robot.template.tsv
mv *.obo output/release/
mv *.robot.template.tsv output/release/
mv *.sssom.tsv output/release/
build-lite: medgen-disease-extract.obo medgen-xrefs.robot.template.tsv sssom

all: build stage clean analyze
# analyze: runs more than just this file; that goal creates multiple files
Expand Down Expand Up @@ -50,6 +51,11 @@ ftp.ncbi.nlm.nih.gov/:
uid2cui.tsv: ftp.ncbi.nlm.nih.gov/
./src/make_uid2cui.pl > $@

ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt: ftp.ncbi.nlm.nih.gov/
if [ -f "ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz" ]; then \
gzip -dk ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz; \
fi

# ----------------------------------------
# Main artefacts
# ----------------------------------------
Expand All @@ -73,11 +79,18 @@ medgen-disease-extract.owl: medgen-disease-extract.obo
owltools $< -o $@

# SSSOM ----------------------------------
medgen.obographs.json:
robot convert -i medgen-disease-extract.owl -o $@

medgen.sssom.tsv: medgen.obographs.json
sssom parse medgen.obographs.json -I obographs-json -m config/medgen.sssom-metadata.yml -o $@
# todo: comemented out old pipeline: remove
#medgen.obographs.json:
# robot convert -i medgen-disease-extract.owl -o $@
#
#medgen.sssom.tsv: medgen.obographs.json
# sssom parse medgen.obographs.json -I obographs-json -m config/medgen.sssom-metadata.yml -o $@
sssom: umls-hpo.sssom.tsv
sssom validate umls-hpo.sssom.tsv
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSSOM validation

As with the ICD11 PR, I also added SSSOM validation here.

sssom validate hpo-mesh.sssom.tsv

umls-hpo.sssom.tsv hpo-mesh.sssom.tsv: ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt
python src/create_sssom.py --input-mappings $< --input-sssom-config config/medgen.sssom-metadata.yml
joeflack4 marked this conversation as resolved.
Show resolved Hide resolved

# ----------------------------------------
# Cycles
Expand Down Expand Up @@ -106,9 +119,6 @@ output/medgen_terms_mapping_status.tsv output/obsoleted_medgen_terms_in_mondo.tx
# ----------------------------------------
# Robot templates
# ----------------------------------------
ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt: ftp.ncbi.nlm.nih.gov/
gzip -d ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz

# todo: Ideally I wanted this done at the end of the ingest, permuting from medgen.sssom.tsv, but there were some
# problems with that file. Eventually changing to that feels like it makes more sense. Will have already been
# pre-curated by disease. And some of the logic in this Python script is duplicative.
Expand Down
2 changes: 2 additions & 0 deletions requirements-unlocked.txt
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
pandas
pyyaml
sssom
47 changes: 47 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,61 @@
annotated-types==0.6.0
attrs==23.2.0
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
curies==0.7.9
Deprecated==1.2.14
deprecation==2.1.0
distlib==0.3.6
exceptiongroup==1.2.0
filelock==3.9.0
hbreader==0.9.1
idna==3.6
importlib_resources==6.4.0
iniconfig==2.0.0
isodate==0.6.1
json-flattener==0.1.9
jsonasobj2==1.0.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
linkml-runtime==1.7.5
networkx==3.3
numpy==1.25.1
packaging==24.0
pandas==2.0.3
pansql==0.0.1
pbr==5.11.1
platformdirs==3.1.0
pluggy==1.4.0
prefixcommons==0.1.12
prefixmaps==0.2.3
pydantic==2.6.4
pydantic_core==2.16.3
pyparsing==3.1.2
pytest==8.1.1
pytest-logging==2015.11.4
python-dateutil==2.8.2
PyTrie==0.4.0
pytz==2023.3
PyYAML==6.0.1
rdflib==7.0.0
referencing==0.34.0
requests==2.31.0
rpds-py==0.18.0
scipy==1.13.0
six==1.16.0
sortedcontainers==2.4.0
SPARQLWrapper==2.0.0
SQLAlchemy==2.0.29
sssom==0.4.6
sssom-schema==0.15.2
stevedore==5.0.0
tomli==2.0.1
typing_extensions==4.11.0
tzdata==2023.3
urllib3==2.2.1
validators==0.28.0
virtualenv==20.20.0
virtualenv-clone==0.5.7
virtualenvwrapper==4.8.4
wrapt==1.16.0
1 change: 1 addition & 0 deletions src/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""MedGen"""
78 changes: 78 additions & 0 deletions src/create_sssom.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
"""Create SSSOM outputs"""
from argparse import ArgumentParser
from pathlib import Path

import pandas as pd

from utils import get_mapping_set, write_sssom

SRC_DIR = Path(__file__).parent
PROJECT_DIR = SRC_DIR.parent
FTP_DIR = PROJECT_DIR / "ftp.ncbi.nlm.nih.gov" / "pub" / "medgen"
CONFIG_DIR = PROJECT_DIR / "config"
INPUT_MAPPINGS = str(FTP_DIR / "MedGenIDMappings.txt")
INPUT_CONFIG = str(CONFIG_DIR / "medgen.sssom-metadata.yml")
OUTPUT_FILE_HPO_UMLS = str(PROJECT_DIR / "umls-hpo.sssom.tsv")
OUTPUT_FILE_HPO_MESH = str(PROJECT_DIR / "hpo-mesh.sssom.tsv")


def _filter_and_format_cols(df: pd.DataFrame, source: str) -> pd.DataFrame:
"""FIlter dataframe by source and format columns."""
return df[df['source'] == source][['subject_id', 'subject_label', 'predicate_id', 'object_id']]


def run(input_mappings: str = INPUT_MAPPINGS, input_sssom_config: str = INPUT_CONFIG, hpo_match_only_with_umls=True):
"""Create SSSOM outputs

:param hpo_match_only_with_umls: If True, only create SSSOM outputs for HPO mappings that have UMLS mappings, and
will filter out other matches. This is purely edge case handling. As of 2024/04/06, 100% of the mappings were UMLS
anyway."""
# SSSOM 1: HPO<->UMLS
df_hpo_umls = get_mapping_set(input_mappings, ['HPO'], add_prefixes=True)
if hpo_match_only_with_umls:
df_hpo_umls = df_hpo_umls[df_hpo_umls['subject_id'].str.startswith('UMLS:')]
df_hpo_umls['mapping_justification'] = 'semapv:ManualMappingCuration'
write_sssom(df_hpo_umls, input_sssom_config, OUTPUT_FILE_HPO_UMLS)

# SSSOM 2: HPO<->MeSH
# - filter
df_hpo_mesh = get_mapping_set(input_mappings, ['MeSH'], add_prefixes=True)
# - JOIN data: some cols temporary for temporary report for non-matches
df_hpo_mesh = pd.merge(df_hpo_mesh, df_hpo_umls, on='subject_id', how='left').rename(columns={
'subject_id': 'umls_id',
'subject_label_x': 'umls_label',
'predicate_id_x': 'predicate_id',
'object_id_x': 'object_id',
'object_id_y': 'subject_id',
})
# -- sort cols & sort rows & drop unneeded cols (subject_label_y, predicate_id_y)
df_hpo_mesh = df_hpo_mesh[['subject_id', 'predicate_id', 'object_id', 'umls_id', 'umls_label']].sort_values(
['subject_id', 'object_id'], na_position='first')
# -- add missing prefixes
df_hpo_mesh['object_id'] = df_hpo_mesh['object_id'].apply(lambda x: 'MESH:' + x)
# todo: temp; (1) remove later: saving dataset with no matches, for review (2) after remove, will need to
# move the col removals below (umls) to above
# - add mapping_justification
df_hpo_mesh['mapping_justification'] = 'semapv:ManualMappingCuration'
write_sssom(df_hpo_mesh, input_sssom_config,
OUTPUT_FILE_HPO_MESH.replace('.sssom.tsv', '-non-matches-included.sssom.tsv'))
# -- filter non-matches & drop unneeded cols
df_hpo_mesh = df_hpo_mesh[df_hpo_mesh['subject_id'].notna()][[
x for x in df_hpo_mesh.columns if not x.startswith('umls')]]
write_sssom(df_hpo_mesh, input_sssom_config, OUTPUT_FILE_HPO_MESH)


def cli():
"""Command line interface."""
parser = ArgumentParser(
prog='Create SSSOM outputs',
description='Create SSSOM outputs from MedGen source')
parser.add_argument(
'-m', '--input-mappings', default=INPUT_MAPPINGS, help='Path to mapping file sourced from MedGen.')
parser.add_argument(
'-c', '--input-sssom-config', default=INPUT_CONFIG, help='Path to SSSOM config yml.')
run(**vars(parser.parse_args()))


if __name__ == '__main__':
cli()
36 changes: 10 additions & 26 deletions src/mondo_robot_template.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@
- Used here: https://github.com/monarch-initiative/mondo/pull/6560
"""
from argparse import ArgumentParser
from copy import copy
from pathlib import Path
from typing import Dict, List

import pandas as pd

from utils import get_mapping_set, add_prefixes_to_plain_id

SRC_DIR = Path(__file__).parent
PROJECT_DIR = SRC_DIR.parent
FTP_DIR = PROJECT_DIR / "ftp.ncbi.nlm.nih.gov" / "pub" / "medgen"
Expand All @@ -25,37 +25,20 @@
}


def _prefixed_id_rows_from_common_df(source_df: pd.DataFrame, mondo_col='mondo_id', xref_col='xref_id') -> List[Dict]:
"""From worksheets having same common format, get prefixed xrefs for the namespaces we're looking to cover

Note: This same exact function is used in:
- mondo repo: medgen_conflicts_add_xrefs.py
- medgen repo: mondo_robot_template.py"""
df = copy(source_df)
df[xref_col] = df[xref_col].apply(
lambda x: f'MEDGENCUI:{x}' if x.startswith('CN') # "CUI Novel"
else f'UMLS:{x}' if x.startswith('C') # CUI 1 of 2: UMLS
else f'MEDGEN:{x}') # UID
rows = df.to_dict('records')
# CUI 2 of 2: MEDGENCUI:
rows2 = [{mondo_col: x[mondo_col], xref_col: x[xref_col].replace('UMLS', 'MEDGENCUI')} for x in rows if
x[xref_col].startswith('UMLS')]
return rows + rows2


def run(input_file: str = INPUT_FILE, output_file: str = OUTPUT_FILE):
"""Create robot template"""
# Read input
df = pd.read_csv(input_file, sep='|').rename(columns={'#CUI': 'xref_id'})

df: pd.DataFrame = get_mapping_set(input_file)
# Get explicit Medgen (CUI, CN) -> Mondo mappings
df_medgen_mondo = df[df['source'] == 'MONDO'][['source_id', 'xref_id']].rename(columns={'source_id': 'mondo_id'})
out_df_cui_cn = pd.DataFrame(_prefixed_id_rows_from_common_df(df_medgen_mondo))
out_df_cui_cn = df_medgen_mondo.copy()
out_df_cui_cn['xref_id'] = out_df_cui_cn['xref_id'].apply(add_prefixes_to_plain_id)

# Get Medgen (UID) -> Mondo mappings
# - Done by proxy: UID <-> CUI <-> MONDO
df_medgen_medgenuid = df[df['source'] == 'MedGen'][['source_id', 'xref_id']].rename(
columns={'source_id': 'medgen_uid'})
# todo: should some of these steps be in _reformat_mapping_set()? to be utilized by SSSOM files?
out_df_uid = pd.merge(df_medgen_mondo, df_medgen_medgenuid, on='xref_id').rename(
columns={'xref_id': 'source_id', 'medgen_uid': 'xref_id'})[['mondo_id', 'xref_id', 'source_id']]
out_df_uid['xref_id'] = out_df_uid['xref_id'].apply(lambda x: f'MEDGEN:{x}')
Expand All @@ -66,15 +49,16 @@ def run(input_file: str = INPUT_FILE, output_file: str = OUTPUT_FILE):
out_df = pd.concat([pd.DataFrame([ROBOT_ROW_MAP]), out_df])
out_df.to_csv(output_file, index=False, sep='\t')


def cli():
"""Command line interface."""
parser = ArgumentParser(
prog='"Medgen->Mondo robot template',
prog='Medgen->Mondo robot template',
description='Create a robot template to be used by Mondo to add MedGen xrefs curated by MedGen.')
parser.add_argument(
'-i', '--input-file', default=INPUT_FILE, help='Mapping file sourced from MedGen')
'-i', '--input-file', default=INPUT_FILE, help='Path to mapping file sourced from MedGen')
parser.add_argument(
'-o', '--output-file', default=OUTPUT_FILE, help='ROBOT template to be used to add xrefs')
'-o', '--output-file', default=OUTPUT_FILE, help='Path to ROBOT template to be used to add xrefs')
run(**vars(parser.parse_args()))


Expand Down
65 changes: 65 additions & 0 deletions src/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
"""Utils"""
from pathlib import Path
from typing import Dict, List, Union

import curies
import pandas as pd
import yaml
from sssom import MappingSetDataFrame
from sssom.writers import write_table


def add_prefixes_to_plain_id(x: str) -> str:
"""From plain IDs from originanl source, add prefixes.

Terms:
CN: stands for "CUI Novel". These are created for any MedGen records without UMLS CUI.
C: stands for "CUI". These are sourced from UMLS.
CUI: stands for "Concept Unique Identifier"
UID (Unique IDentifier): These are cases where the id is all digits; does not start with a leading alpha char.
"""
return f'MEDGENCUI:{x}' if x.startswith('CN') \
joeflack4 marked this conversation as resolved.
Show resolved Hide resolved
else f'UMLS:{x}' if x.startswith('C') \
else f'MEDGEN:{x}'


def write_sssom(df: pd.DataFrame, config_path: Union[Path, str], outpath: Union[Path, str]):
"""Writes a SSSOM file"""
with open(config_path, 'r') as yaml_file:
metadata: Dict = yaml.load(yaml_file, Loader=yaml.FullLoader)
converter = curies.Converter.from_prefix_map(metadata['curie_map'])
msdf: MappingSetDataFrame = MappingSetDataFrame(converter=converter, df=df, metadata=metadata)
with open(outpath, 'w') as f:
write_table(msdf, f)


# todo: for the SSSOM use case, it is weird to rename #CUI as xref_id. so maybe _get_mapping_set() should either not
# common code for this and robot template, or add a param to not rename that col
def get_mapping_set(
inpath: Union[str, Path], filter_sources: List[str] = None, add_prefixes=False, sssomify=True,
) -> pd.DataFrame:
"""Load up MedGen mapping set (MedGenIDMappings.txt), with some modifications."""
# Read
df = pd.read_csv(inpath, sep='|').rename(columns={'#CUI': 'xref_id'})
# Remove empty columns
empty_cols = [col for col in df.columns if df[col].isnull().all()] # caused by trailing | at end of each row
if empty_cols:
df = df.drop(columns=empty_cols)
# Add prefixes
if add_prefixes:
df['xref_id'] = df['xref_id'].apply(add_prefixes_to_plain_id)
# Sort
df = df.sort_values(['xref_id', 'source_id'])
if filter_sources:
df = df[df['source'].isin(filter_sources)]
del df['source']
# Standardize to SSSOM
if sssomify:
df = df.rename(columns={
'xref_id': 'subject_id',
'pref_name': 'subject_label',
'source_id': 'object_id',
})
df['predicate_id'] = 'skos:exactMatch'
df = df[['subject_id', 'subject_label', 'predicate_id', 'object_id']]
return df