-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HPO<->UMLS and HPO<->MeSH mapping to MedGen release #18
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,9 @@ | ||
# MedGen ingest | ||
# Running `make all` will run the full pipeline. Note that if the FTP files have already been downloaded, it'll skip | ||
# that part. In order to force re-download, run `make all -B`. | ||
# todo: remove parts of old make/perl pipeline no longer used | ||
.DEFAULT_GOAL := all | ||
.PHONY: all build stage stage-% analyze clean deploy-release build-lite minimal | ||
.PHONY: all build stage stage-% analyze clean deploy-release build-lite minimal sssom | ||
|
||
OBO=http://purl.obolibrary.org/obo | ||
PRODUCTS=medgen-disease-extract.obo medgen-disease-extract.owl | ||
|
@@ -14,10 +15,10 @@ minimal: build-lite stage-lite clean | |
stage-lite: | output/release/ | ||
# mv medgen-disease-extract.owl output/release/ | ||
# mv medgen.sssom.tsv output/release/ | ||
mv medgen.obo output/release/ | ||
mv medgen-disease-extract.obo output/release/ | ||
mv medgen-xrefs.robot.template.tsv output/release/ | ||
build-lite: medgen-disease-extract.obo medgen-xrefs.robot.template.tsv | ||
mv *.obo output/release/ | ||
mv *.robot.template.tsv output/release/ | ||
mv *.sssom.tsv output/release/ | ||
build-lite: medgen-disease-extract.obo medgen-xrefs.robot.template.tsv sssom | ||
|
||
all: build stage clean analyze | ||
# analyze: runs more than just this file; that goal creates multiple files | ||
|
@@ -50,6 +51,11 @@ ftp.ncbi.nlm.nih.gov/: | |
uid2cui.tsv: ftp.ncbi.nlm.nih.gov/ | ||
./src/make_uid2cui.pl > $@ | ||
|
||
ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt: ftp.ncbi.nlm.nih.gov/ | ||
if [ -f "ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz" ]; then \ | ||
gzip -dk ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz; \ | ||
fi | ||
|
||
# ---------------------------------------- | ||
# Main artefacts | ||
# ---------------------------------------- | ||
|
@@ -73,11 +79,18 @@ medgen-disease-extract.owl: medgen-disease-extract.obo | |
owltools $< -o $@ | ||
|
||
# SSSOM ---------------------------------- | ||
medgen.obographs.json: | ||
robot convert -i medgen-disease-extract.owl -o $@ | ||
|
||
medgen.sssom.tsv: medgen.obographs.json | ||
sssom parse medgen.obographs.json -I obographs-json -m config/medgen.sssom-metadata.yml -o $@ | ||
# todo: comemented out old pipeline: remove | ||
#medgen.obographs.json: | ||
# robot convert -i medgen-disease-extract.owl -o $@ | ||
# | ||
#medgen.sssom.tsv: medgen.obographs.json | ||
# sssom parse medgen.obographs.json -I obographs-json -m config/medgen.sssom-metadata.yml -o $@ | ||
sssom: umls-hpo.sssom.tsv | ||
sssom validate umls-hpo.sssom.tsv | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. SSSOM validationAs with the ICD11 PR, I also added SSSOM validation here. |
||
sssom validate hpo-mesh.sssom.tsv | ||
|
||
umls-hpo.sssom.tsv hpo-mesh.sssom.tsv: ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt | ||
python src/create_sssom.py --input-mappings $< --input-sssom-config config/medgen.sssom-metadata.yml | ||
joeflack4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# ---------------------------------------- | ||
# Cycles | ||
|
@@ -106,9 +119,6 @@ output/medgen_terms_mapping_status.tsv output/obsoleted_medgen_terms_in_mondo.tx | |
# ---------------------------------------- | ||
# Robot templates | ||
# ---------------------------------------- | ||
ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt: ftp.ncbi.nlm.nih.gov/ | ||
gzip -d ftp.ncbi.nlm.nih.gov/pub/medgen/MedGenIDMappings.txt.gz | ||
|
||
# todo: Ideally I wanted this done at the end of the ingest, permuting from medgen.sssom.tsv, but there were some | ||
# problems with that file. Eventually changing to that feels like it makes more sense. Will have already been | ||
# pre-curated by disease. And some of the logic in this Python script is duplicative. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,3 @@ | ||
pandas | ||
pyyaml | ||
sssom |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,61 @@ | ||
annotated-types==0.6.0 | ||
attrs==23.2.0 | ||
certifi==2024.2.2 | ||
charset-normalizer==3.3.2 | ||
click==8.1.7 | ||
curies==0.7.9 | ||
Deprecated==1.2.14 | ||
deprecation==2.1.0 | ||
distlib==0.3.6 | ||
exceptiongroup==1.2.0 | ||
filelock==3.9.0 | ||
hbreader==0.9.1 | ||
idna==3.6 | ||
importlib_resources==6.4.0 | ||
iniconfig==2.0.0 | ||
isodate==0.6.1 | ||
json-flattener==0.1.9 | ||
jsonasobj2==1.0.4 | ||
jsonschema==4.21.1 | ||
jsonschema-specifications==2023.12.1 | ||
linkml-runtime==1.7.5 | ||
networkx==3.3 | ||
numpy==1.25.1 | ||
packaging==24.0 | ||
pandas==2.0.3 | ||
pansql==0.0.1 | ||
pbr==5.11.1 | ||
platformdirs==3.1.0 | ||
pluggy==1.4.0 | ||
prefixcommons==0.1.12 | ||
prefixmaps==0.2.3 | ||
pydantic==2.6.4 | ||
pydantic_core==2.16.3 | ||
pyparsing==3.1.2 | ||
pytest==8.1.1 | ||
pytest-logging==2015.11.4 | ||
python-dateutil==2.8.2 | ||
PyTrie==0.4.0 | ||
pytz==2023.3 | ||
PyYAML==6.0.1 | ||
rdflib==7.0.0 | ||
referencing==0.34.0 | ||
requests==2.31.0 | ||
rpds-py==0.18.0 | ||
scipy==1.13.0 | ||
six==1.16.0 | ||
sortedcontainers==2.4.0 | ||
SPARQLWrapper==2.0.0 | ||
SQLAlchemy==2.0.29 | ||
sssom==0.4.6 | ||
sssom-schema==0.15.2 | ||
stevedore==5.0.0 | ||
tomli==2.0.1 | ||
typing_extensions==4.11.0 | ||
tzdata==2023.3 | ||
urllib3==2.2.1 | ||
validators==0.28.0 | ||
virtualenv==20.20.0 | ||
virtualenv-clone==0.5.7 | ||
virtualenvwrapper==4.8.4 | ||
wrapt==1.16.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
"""MedGen""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
"""Create SSSOM outputs""" | ||
from argparse import ArgumentParser | ||
from pathlib import Path | ||
|
||
import pandas as pd | ||
|
||
from utils import get_mapping_set, write_sssom | ||
|
||
SRC_DIR = Path(__file__).parent | ||
PROJECT_DIR = SRC_DIR.parent | ||
FTP_DIR = PROJECT_DIR / "ftp.ncbi.nlm.nih.gov" / "pub" / "medgen" | ||
CONFIG_DIR = PROJECT_DIR / "config" | ||
INPUT_MAPPINGS = str(FTP_DIR / "MedGenIDMappings.txt") | ||
INPUT_CONFIG = str(CONFIG_DIR / "medgen.sssom-metadata.yml") | ||
OUTPUT_FILE_HPO_UMLS = str(PROJECT_DIR / "umls-hpo.sssom.tsv") | ||
OUTPUT_FILE_HPO_MESH = str(PROJECT_DIR / "hpo-mesh.sssom.tsv") | ||
|
||
|
||
def _filter_and_format_cols(df: pd.DataFrame, source: str) -> pd.DataFrame: | ||
"""FIlter dataframe by source and format columns.""" | ||
return df[df['source'] == source][['subject_id', 'subject_label', 'predicate_id', 'object_id']] | ||
|
||
|
||
def run(input_mappings: str = INPUT_MAPPINGS, input_sssom_config: str = INPUT_CONFIG, hpo_match_only_with_umls=True): | ||
"""Create SSSOM outputs | ||
|
||
:param hpo_match_only_with_umls: If True, only create SSSOM outputs for HPO mappings that have UMLS mappings, and | ||
will filter out other matches. This is purely edge case handling. As of 2024/04/06, 100% of the mappings were UMLS | ||
anyway.""" | ||
# SSSOM 1: HPO<->UMLS | ||
df_hpo_umls = get_mapping_set(input_mappings, ['HPO'], add_prefixes=True) | ||
if hpo_match_only_with_umls: | ||
df_hpo_umls = df_hpo_umls[df_hpo_umls['subject_id'].str.startswith('UMLS:')] | ||
df_hpo_umls['mapping_justification'] = 'semapv:ManualMappingCuration' | ||
write_sssom(df_hpo_umls, input_sssom_config, OUTPUT_FILE_HPO_UMLS) | ||
|
||
# SSSOM 2: HPO<->MeSH | ||
# - filter | ||
df_hpo_mesh = get_mapping_set(input_mappings, ['MeSH'], add_prefixes=True) | ||
# - JOIN data: some cols temporary for temporary report for non-matches | ||
df_hpo_mesh = pd.merge(df_hpo_mesh, df_hpo_umls, on='subject_id', how='left').rename(columns={ | ||
'subject_id': 'umls_id', | ||
'subject_label_x': 'umls_label', | ||
'predicate_id_x': 'predicate_id', | ||
'object_id_x': 'object_id', | ||
'object_id_y': 'subject_id', | ||
}) | ||
# -- sort cols & sort rows & drop unneeded cols (subject_label_y, predicate_id_y) | ||
df_hpo_mesh = df_hpo_mesh[['subject_id', 'predicate_id', 'object_id', 'umls_id', 'umls_label']].sort_values( | ||
['subject_id', 'object_id'], na_position='first') | ||
# -- add missing prefixes | ||
df_hpo_mesh['object_id'] = df_hpo_mesh['object_id'].apply(lambda x: 'MESH:' + x) | ||
# todo: temp; (1) remove later: saving dataset with no matches, for review (2) after remove, will need to | ||
# move the col removals below (umls) to above | ||
# - add mapping_justification | ||
df_hpo_mesh['mapping_justification'] = 'semapv:ManualMappingCuration' | ||
write_sssom(df_hpo_mesh, input_sssom_config, | ||
OUTPUT_FILE_HPO_MESH.replace('.sssom.tsv', '-non-matches-included.sssom.tsv')) | ||
# -- filter non-matches & drop unneeded cols | ||
df_hpo_mesh = df_hpo_mesh[df_hpo_mesh['subject_id'].notna()][[ | ||
x for x in df_hpo_mesh.columns if not x.startswith('umls')]] | ||
write_sssom(df_hpo_mesh, input_sssom_config, OUTPUT_FILE_HPO_MESH) | ||
|
||
|
||
def cli(): | ||
"""Command line interface.""" | ||
parser = ArgumentParser( | ||
prog='Create SSSOM outputs', | ||
description='Create SSSOM outputs from MedGen source') | ||
parser.add_argument( | ||
'-m', '--input-mappings', default=INPUT_MAPPINGS, help='Path to mapping file sourced from MedGen.') | ||
parser.add_argument( | ||
'-c', '--input-sssom-config', default=INPUT_CONFIG, help='Path to SSSOM config yml.') | ||
run(**vars(parser.parse_args())) | ||
|
||
|
||
if __name__ == '__main__': | ||
cli() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
"""Utils""" | ||
from pathlib import Path | ||
from typing import Dict, List, Union | ||
|
||
import curies | ||
import pandas as pd | ||
import yaml | ||
from sssom import MappingSetDataFrame | ||
from sssom.writers import write_table | ||
|
||
|
||
def add_prefixes_to_plain_id(x: str) -> str: | ||
"""From plain IDs from originanl source, add prefixes. | ||
|
||
Terms: | ||
CN: stands for "CUI Novel". These are created for any MedGen records without UMLS CUI. | ||
C: stands for "CUI". These are sourced from UMLS. | ||
CUI: stands for "Concept Unique Identifier" | ||
UID (Unique IDentifier): These are cases where the id is all digits; does not start with a leading alpha char. | ||
""" | ||
return f'MEDGENCUI:{x}' if x.startswith('CN') \ | ||
joeflack4 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
else f'UMLS:{x}' if x.startswith('C') \ | ||
else f'MEDGEN:{x}' | ||
|
||
|
||
def write_sssom(df: pd.DataFrame, config_path: Union[Path, str], outpath: Union[Path, str]): | ||
"""Writes a SSSOM file""" | ||
with open(config_path, 'r') as yaml_file: | ||
metadata: Dict = yaml.load(yaml_file, Loader=yaml.FullLoader) | ||
converter = curies.Converter.from_prefix_map(metadata['curie_map']) | ||
msdf: MappingSetDataFrame = MappingSetDataFrame(converter=converter, df=df, metadata=metadata) | ||
with open(outpath, 'w') as f: | ||
write_table(msdf, f) | ||
|
||
|
||
# todo: for the SSSOM use case, it is weird to rename #CUI as xref_id. so maybe _get_mapping_set() should either not | ||
# common code for this and robot template, or add a param to not rename that col | ||
def get_mapping_set( | ||
inpath: Union[str, Path], filter_sources: List[str] = None, add_prefixes=False, sssomify=True, | ||
) -> pd.DataFrame: | ||
"""Load up MedGen mapping set (MedGenIDMappings.txt), with some modifications.""" | ||
# Read | ||
df = pd.read_csv(inpath, sep='|').rename(columns={'#CUI': 'xref_id'}) | ||
# Remove empty columns | ||
empty_cols = [col for col in df.columns if df[col].isnull().all()] # caused by trailing | at end of each row | ||
if empty_cols: | ||
df = df.drop(columns=empty_cols) | ||
# Add prefixes | ||
if add_prefixes: | ||
df['xref_id'] = df['xref_id'].apply(add_prefixes_to_plain_id) | ||
# Sort | ||
df = df.sort_values(['xref_id', 'source_id']) | ||
if filter_sources: | ||
df = df[df['source'].isin(filter_sources)] | ||
del df['source'] | ||
# Standardize to SSSOM | ||
if sssomify: | ||
df = df.rename(columns={ | ||
'xref_id': 'subject_id', | ||
'pref_name': 'subject_label', | ||
'source_id': 'object_id', | ||
}) | ||
df['predicate_id'] = 'skos:exactMatch' | ||
df = df[['subject_id', 'subject_label', 'predicate_id', 'object_id']] | ||
return df |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SSSOM Outputs
I ran a new build and created a release which includes these. Here are some sample rows.
Questions
Should MedGen-derived prefixes just be "UMLS", or one of ("UMLS", "MEDGENCUI", or "MEDGEN") if/where applicable?
UMLS: Compose 100% of the subjects in
hpo-umls.sssom.tsv
currently.MEDGENCUI: Note that there were no MEDGENCUI<->HPO mappings in this set in
MedGenIDMappings.txt
. Also note that I did see your instruction in Finalise MedGen xref table #15: Remove all rows with MEDGENCUI in it (we only need xrefs to MEDGEN:123 and UMLS:123) (please see & respond to my related question in that issue). However, applied our previous logic discussed in "Refactoring > Prefix assignment", and verified afterwards that there was no MEDGENCUI.MEDGEN: These are IDs that have neither CN or C at the beginning; they are UIDs. I did not check to see if we could actually get HPO<->MEDGEN mappings. Do you want them?
I think I renamed
hpo-umls.sssom.tsv
incorrectly. Just realizing this now. I have the subjects as UMLS. But HPO comes first in the filename; so perhaps I should make it so that the position is consistent? Does it matter to you which is the subject and which is the object?I could include
mapping_justification
, but I don't know their process, and I don't know if we can say if there is or isn't variability in how the do their mappings.hpo-umls.sssom.tsv
hpo-mesh.sssom.tsv
This one doesn't have
subject_label
orobject_label
because I couldn't guarantee that the label inMedGenIDMappings.txt
was an accurate reflection of either source in 100% of cases. I imagine these labels are coming either from UMLS or from MedGen.hpo-mesh-no-matches.sssom.tsv
hpo-mesh-no-matches.sssom.tsv.zip
Not included in the release. This is for analysis. I just wanted you to see the cases where there were no matches. There were only about 2,300 HPO<->MeSH mappings that could be derived out of the ~16,000 MeSH terms that have UMLS mappings. I included
umls_id
andumls_label
in this set for review. It also includes all of the rows for all of the matches; not just the non-matches. I sorted the non-matches to the top.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now yes. Maybe in the future (but not now) we will look for medgen mappings as well, but for now, we only want HPO-UMLS mappings.
Its cosmetic, ask your own inner sense of style. I like the order, but wont insist on it if it takes a lot of time to fix
Please use sssom toolkit as you want to release a valid sssom file.
mapping_justification
is mandatory. Since itsMedGen
I would tend to usesemapv:ManualMappingCuration
.mapping_justification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited your comment w/ checkboxes. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done!