Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HGNC robot template #113

Merged
merged 3 commits into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/buid_and_release.yml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,4 @@ jobs:
files: |
omim.owl
omim.sssom.tsv
mondo_genes.csv
mondo_genes.robot.tsv
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,4 @@ omim.json
mondo_exactmatch_omim.sssom.tsv
mondo_exactmatch_omimps.sssom.tsv
omim.owl
mondo_genes.csv
mondo_genes.robot.tsv
Copy link
Contributor Author

@joeflack4 joeflack4 Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename: mondo_genes.csv --> mondo-omim-genes.robot.tsv

Throughout the code base.

17 changes: 12 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ OMIM stands for "Online Mendelian Inheritance in Man", and is an online
catalog of human genes and genetic disorders. The official site is: https://omim.org/

This purpose of this repository is for data transformations for ingest into Mondo. Mainly,
it is for generating an `omim.ttl` file.
it is for generating an `omim.ttl` and other release artefacts.

Disclaimer: This repository and its created data artefacts are unnofficial. For
official, up-to-date OMIM data, please visit [omim.org](https://omim.org).
Expand All @@ -31,10 +31,10 @@ you get an error related to this when installing, ignore it, as it is does not
seem to be needed to run any of the tools. If however you do get a `psutil` error
when running anything, please let us know by [creating an issue](https://github.com/monarch-initiative/omim/issues/new).

## Running & creating `omim.ttl`
Run: `make all`
## Running & creating release
Run: `sh run.sh make all`

Running this will create a new `omim.ttl` file in the root directory.
Running this will create new release artefacts in the root directory.

You can also run `make build` or `python -m omim2obo`. These are all the same
command. This will download files from omim.org and run the build.
Expand All @@ -44,8 +44,11 @@ If there's an issue downloading the files, or you are offline, or you just want
to use the cache anyway, you can pass the `--use-cache` flag.

## Additional tools
<details><summary>Details</summary>
twhetzel marked this conversation as resolved.
Show resolved Hide resolved
<p>

### Get PMIDs used for OMIM codes from `omim.ttl`
Command: `make get-pmids`
Command: `sh run.sh make get-pmids`

### OMIM Code Web Scraper
Currently, the only feature is `get_codes_by_yyyy_mm`, which returns a list of
Expand Down Expand Up @@ -86,3 +89,7 @@ from omim2obo.omim_code_scraper import get_codes_by_yyyy_mm

code_tuples = get_codes_by_yyyy_mm('2021/05')
```


</p>
</details>
14 changes: 12 additions & 2 deletions makefile
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@


# MAIN COMMANDS / GOALS ------------------------------------------------------------------------------------------------
all: omim.ttl omim.sssom.tsv omim.owl mondo_genes.csv
all: omim.ttl omim.sssom.tsv omim.owl mondo_genes.robot.tsv

# build: Create new omim.ttl
omim.ttl:
Expand Down Expand Up @@ -35,8 +35,18 @@ omim.owl: omim.ttl mondo_exactmatch_omim.sssom.owl mondo_exactmatch_omimps.sssom
query --update sparql/hgnc_links.ru \
convert -f ofn -o $@

mondo_genes.csv: omim.owl
mondo_genes.robot.tsv: omim.owl
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Output in TSV now instead of CSV

  • ROBOT automatically does this based on the file extension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now renamed to mondo-omim-genes.robot.tsv

Copy link
Contributor Author

@joeflack4 joeflack4 Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite implementation: awk --> pandas

Nico:

I am weary of the extreme use of awk, but as long as it is dockerized.. I would advice caution on this and focus on building mondolib

Trish:

I am also weary of the extreme use of awk and would prefer to find another option.

Haha, this is funny, because I feel the same way. I thought for some reason you guys would probably prefer a ShellScript solution to pandas, but that was also when I thought I only needed to do 2 manipulations, but it turned out to be 4.

After I wrote that, I sent this to my friend who heavily uses awk and sed, who I've been trying to get to use pandas. Not sure if you guys are familiar with this meme, lol:

Meme

8t69kz

It should be an easy rewrite into pandas, so I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I appreciate your efforts here, but also think something more readable and more easily portable to a common solution in mondolib eventually will be helpful longer term :)

Copy link
Contributor Author

@joeflack4 joeflack4 Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Please take a look at the new Python file and refactored make goal.

I also added column sorting. Forgot to do that before, and it's not entirely unimportant.

I re-ran the goal and the output is the same as what I've attached to the release, the only difference being the sorting. I'll update that file shortly.

RE: mondolib refactor: I'm sure there's some kind of ROBOT-template-fu that we could move over there, but I'm not sure yet what that would be. I write a lot of code that looks similar, but the ROBOT templates and the modifications I do to create them vary quite a bit.

# Create a TSV of relational information for gene and disease classes
robot query -i omim.owl --query sparql/mondo_genes.sparql $@
# Insert the source_code column as the second to last column
awk 'BEGIN {FS=OFS="\t"} {if (NR==1) {$$(NF+1)=$$(NF); $$(NF-1)="?source_code";} else {$$(NF+1)=$$(NF); $$(NF-1)="MONDO:OMIM";}} 1' $@ > temp_file && mv temp_file $@
joeflack4 marked this conversation as resolved.
Show resolved Hide resolved
# Remove the first character, a question mark (?), from each field in the header. This is an artefact of the SPARQL query.
awk 'BEGIN {FS=OFS="\t"} NR==1 {for (i=1; i<=NF; i++) $$i=substr($$i, 2)} {print}' $@ > temp_file && mv temp_file $@
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Remove the first character, a question mark (?), from each field in the header. This is an artefact of the SPARQL query.

# Remove < and > characters from specified columns
awk 'BEGIN {FS=OFS="\t"} NR>1 {gsub(/^<|>$$/, "", $$1); gsub(/^<|>$$/, "", $$2); gsub(/^<|>$$/, "", $$5)} {print}' $@ > temp_file && mv temp_file $@
joeflack4 marked this conversation as resolved.
Show resolved Hide resolved
# Insert ROBOT subheader
robot_subheader="ID\tSC 'has material basis in germline mutation in' some %\t>A oboInOwl:source\t>A oboInOwl:source\t" && \
sed 1a"$$robot_subheader" $@ > temp_file && mv temp_file $@
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: ROBOT subheader

Copy link
Contributor Author

@joeflack4 joeflack4 Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hgnc_id : SC 'has material basis in germline mutation in' some %

What Nico wrote in the issue was a placeholder for the actual thing. I went and looked through some examples we had of this pattern SC '<PROPERTY>' some %, and also found the correct string representation 'has material basis in germline mutation in'. I'm basing it off of several different locations in mondo where I saw this: '%s and ''has material basis in germline mutation in'' some %s'


cleanup:
@rm -f omim.json
Expand Down
85 changes: 85 additions & 0 deletions run.sh
twhetzel marked this conversation as resolved.
Show resolved Hide resolved
twhetzel marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
#!/bin/sh
# Wrapper script for docker.
#
# This is used primarily for wrapping the GNU Make workflow.
# Instead of typing "make TARGET", type "./run.sh make TARGET".
# This will run the make workflow within a docker container.
#
# The assumption is that you are working in the src/ontology folder;
# we therefore map the whole repo (../..) to a docker volume.
#
# To use singularity instead of docker, please issue
# export USE_SINGULARITY=<any-value>
# before running this script.
#
# See README-editors.md for more details.

if [ -f run.sh.conf ]; then
. ./run.sh.conf
fi

# Look for a GitHub token
if [ -n "$GH_TOKEN" ]; then
:
elif [ -f ../../.github/token.txt ]; then
GH_TOKEN=$(cat ../../.github/token.txt)
elif [ -f $XDG_CONFIG_HOME/ontology-development-kit/github/token ]; then
GH_TOKEN=$(cat $XDG_CONFIG_HOME/ontology-development-kit/github/token)
elif [ -f "$HOME/Library/Application Support/ontology-development-kit/github/token" ]; then
GH_TOKEN=$(cat "$HOME/Library/Application Support/ontology-development-kit/github/token")
fi

ODK_IMAGE=${ODK_IMAGE:-odkfull}
TAG_IN_IMAGE=$(echo $ODK_IMAGE | awk -F':' '{ print $2 }')
if [ -n "$TAG_IN_IMAGE" ]; then
# Override ODK_TAG env var if IMAGE already includes a tag
ODK_TAG=$TAG_IN_IMAGE
ODK_IMAGE=$(echo $ODK_IMAGE | awk -F':' '{ print $1 }')
fi
ODK_TAG=${ODK_TAG:-v1.4.3}
ODK_JAVA_OPTS=${ODK_JAVA_OPTS:--Xmx20G}
ODK_DEBUG=${ODK_DEBUG:-no}

# Convert OWLAPI_* environment variables to the OWLAPI as Java options
# See http://owlcs.github.io/owlapi/apidocs_4/org/semanticweb/owlapi/model/parameters/ConfigurationOptions.html
# for a list of allowed options
OWLAPI_OPTIONS_NAMESPACE=org.semanticweb.owlapi.model.parameters.ConfigurationOptions
for owlapi_var in $(env | sed -n s/^OWLAPI_//p) ; do
ODK_JAVA_OPTS="$ODK_JAVA_OPTS -D$OWLAPI_OPTIONS_NAMESPACE.${owlapi_var%=*}=${owlapi_var#*=}"
done

TIMECMD=
if [ x$ODK_DEBUG = xyes ]; then
# If you wish to change the format string, take care of using
# non-breaking spaces (U+00A0) instead of normal spaces, to
# prevent the shell from tokenizing the format string.
echo "Running ${IMAGE} with ${ODK_JAVA_OPTS} of memory for ROBOT and Java-based pipeline steps."
TIMECMD="/usr/bin/time -f ### DEBUG STATS ###\nElapsed time: %E\nPeak memory: %M kb"
fi

VOLUME_BIND=$PWD:/work
WORK_DIR=/work

if [ -n "$ODK_BINDS" ]; then
VOLUME_BIND="$VOLUME_BIND,$ODK_BINDS"
fi

if [ -n "$USE_SINGULARITY" ]; then

singularity exec --cleanenv $ODK_SINGULARITY_OPTIONS \
--env "ROBOT_JAVA_ARGS=$ODK_JAVA_OPTS,JAVA_OPTS=$ODK_JAVA_OPTS" \
--bind $VOLUME_BIND \
-W $WORK_DIR \
docker://obolibrary/$ODK_IMAGE:$ODK_TAG $TIMECMD "$@"
else
BIND_OPTIONS="-v $(echo $VOLUME_BIND | sed 's/,/ -v /')"
docker run $ODK_DOCKER_OPTIONS $BIND_OPTIONS -w $WORK_DIR \
-e ROBOT_JAVA_ARGS="$ODK_JAVA_OPTS" -e JAVA_OPTS="$ODK_JAVA_OPTS" \
--rm -ti obolibrary/$ODK_IMAGE:$ODK_TAG $TIMECMD "$@"
fi

case "$@" in
*update_repo*|*release*)
echo "Please remember to update your ODK image from time to time: https://oboacademy.github.io/obook/howto/odk-update/."
;;
esac