Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HGNC robot template #113

Merged
merged 3 commits into from
Jun 26, 2024
Merged

HGNC robot template #113

merged 3 commits into from
Jun 26, 2024

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Jun 6, 2024

Addresses sub-tasks in:

Related:

Overview

Update mondo_genes.csv to be a proper ROBOT template: mondo-omim-genes.robot.tsv

Changes

HGNC ROBOT template

  • Rename: mondo_genes.csv --> mondo-omim-genes.robot.tsv
  • Update: Change from CSV to TSV
  • Update: Set a ROBOT sub-header
  • Update: remove < > around URIs
  • Update: remove ?'s at start of col names
  • Update: insert source_code col, w/ values: MONDO:OMIM

General:

  • Add: run.sh: For running ODK. And updated README.md w/ docs about that.
  • Update: README.md: Put some less important stuff in

CC: @souzadevinicius Thought this would be a good one for you to review

@joeflack4 joeflack4 marked this pull request as draft June 6, 2024 20:47
@joeflack4 joeflack4 self-assigned this Jun 6, 2024
@joeflack4 joeflack4 added enhancement New feature or request omim labels Jun 6, 2024
@joeflack4 joeflack4 marked this pull request as ready for review June 6, 2024 22:22
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.gitignore Outdated
@@ -34,4 +34,4 @@ omim.json
mondo_exactmatch_omim.sssom.tsv
mondo_exactmatch_omimps.sssom.tsv
omim.owl
mondo_genes.csv
mondo_genes.robot.tsv
Copy link
Contributor Author

@joeflack4 joeflack4 Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename: mondo_genes.csv --> mondo-omim-genes.robot.tsv

Throughout the code base.

README.md Show resolved Hide resolved
makefile Outdated Show resolved Hide resolved
makefile Outdated
@@ -35,8 +35,18 @@ omim.owl: omim.ttl mondo_exactmatch_omim.sssom.owl mondo_exactmatch_omimps.sssom
query --update sparql/hgnc_links.ru \
convert -f ofn -o $@

mondo_genes.csv: omim.owl
mondo_genes.robot.tsv: omim.owl
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Output in TSV now instead of CSV

  • ROBOT automatically does this based on the file extension

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now renamed to mondo-omim-genes.robot.tsv

makefile Outdated
# Insert the source_code column as the second to last column
awk 'BEGIN {FS=OFS="\t"} {if (NR==1) {$$(NF+1)=$$(NF); $$(NF-1)="?source_code";} else {$$(NF+1)=$$(NF); $$(NF-1)="MONDO:OMIM";}} 1' $@ > temp_file && mv temp_file $@
# Remove the first character of each field in the header
awk 'BEGIN {FS=OFS="\t"} NR==1 {for (i=1; i<=NF; i++) $$i=substr($$i, 2)} {print}' $@ > temp_file && mv temp_file $@
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Remove the first character, a question mark (?), from each field in the header. This is an artefact of the SPARQL query.

makefile Outdated Show resolved Hide resolved
makefile Outdated
awk 'BEGIN {FS=OFS="\t"} NR>1 {gsub(/^<|>$$/, "", $$1); gsub(/^<|>$$/, "", $$2); gsub(/^<|>$$/, "", $$5)} {print}' $@ > temp_file && mv temp_file $@
# Insert ROBOT subheader
robot_subheader="ID\tSC 'has material basis in germline mutation in' some %\t>A oboInOwl:source\t>A oboInOwl:source\t" && \
sed 1a"$$robot_subheader" $@ > temp_file && mv temp_file $@
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: ROBOT subheader

run.sh Show resolved Hide resolved
@joeflack4 joeflack4 requested review from matentzn and twhetzel and removed request for matentzn June 6, 2024 23:08
makefile Outdated
awk 'BEGIN {FS=OFS="\t"} NR>1 {gsub(/^<|>$$/, "", $$1); gsub(/^<|>$$/, "", $$2); gsub(/^<|>$$/, "", $$5)} {print}' $@ > temp_file && mv temp_file $@
# Insert ROBOT subheader
robot_subheader="ID\tSC 'has material basis in germline mutation in' some %\t>A oboInOwl:source\t>A oboInOwl:source\t" && \
sed 1a"$$robot_subheader" $@ > temp_file && mv temp_file $@
Copy link
Contributor Author

@joeflack4 joeflack4 Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hgnc_id : SC 'has material basis in germline mutation in' some %

What Nico wrote in the issue was a placeholder for the actual thing. I went and looked through some examples we had of this pattern SC '<PROPERTY>' some %, and also found the correct string representation 'has material basis in germline mutation in'. I'm basing it off of several different locations in mondo where I saw this: '%s and ''has material basis in germline mutation in'' some %s'

- Rename: mondo_genes.csv --> mondo_genes.robot.tsv
- Update: Change from CSV to TSV
- Update: Set a ROBOT sub-header
- Update: remove < > around URIs
- Update: remove ?'s at start of col names
- Update: insert source_code col, w/ values: MONDO:OMIM

General:
- Add: run.sh: For running ODK. And updated README.md w/ docs about that.
- Update: README.md: Put some less important stuff in <details>
Copy link
Member

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am weary of the extreme use of awk, but as long as it is dockerized.. I would advice caution on this and focus on building mondolib

makefile Outdated Show resolved Hide resolved
makefile Outdated Show resolved Hide resolved
run.sh Show resolved Hide resolved
@twhetzel
Copy link
Contributor

I am also weary of the extreme use of awk and would prefer to find another option.

makefile Outdated
@@ -35,8 +35,18 @@ omim.owl: omim.ttl mondo_exactmatch_omim.sssom.owl mondo_exactmatch_omimps.sssom
query --update sparql/hgnc_links.ru \
convert -f ofn -o $@

mondo_genes.csv: omim.owl
mondo_genes.robot.tsv: omim.owl
Copy link
Contributor Author

@joeflack4 joeflack4 Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite implementation: awk --> pandas

Nico:

I am weary of the extreme use of awk, but as long as it is dockerized.. I would advice caution on this and focus on building mondolib

Trish:

I am also weary of the extreme use of awk and would prefer to find another option.

Haha, this is funny, because I feel the same way. I thought for some reason you guys would probably prefer a ShellScript solution to pandas, but that was also when I thought I only needed to do 2 manipulations, but it turned out to be 4.

After I wrote that, I sent this to my friend who heavily uses awk and sed, who I've been trying to get to use pandas. Not sure if you guys are familiar with this meme, lol:

Meme

8t69kz

It should be an easy rewrite into pandas, so I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I appreciate your efforts here, but also think something more readable and more easily portable to a common solution in mondolib eventually will be helpful longer term :)

Copy link
Contributor Author

@joeflack4 joeflack4 Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Please take a look at the new Python file and refactored make goal.

I also added column sorting. Forgot to do that before, and it's not entirely unimportant.

I re-ran the goal and the output is the same as what I've attached to the release, the only difference being the sorting. I'll update that file shortly.

RE: mondolib refactor: I'm sure there's some kind of ROBOT-template-fu that we could move over there, but I'm not sure yet what that would be. I write a lot of code that looks similar, but the ROBOT templates and the modifications I do to create them vary quite a bit.

@joeflack4 joeflack4 force-pushed the hgnc-template branch 2 times, most recently from 5c18c12 to 264181a Compare June 11, 2024 22:47
@joeflack4 joeflack4 added the hgnc label Jun 13, 2024
- Update: Refactor method to do this from ShellScript / awk to Python / pandas.
- Update: Now sorts columns

General
- Update: .gitignore: Simplified ignores for files at root.
- Add: Utility function to handle < > around URIs
Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine other than the open question about the source column, which will be sorted Monday.

- Delete: source_code column (w/ values: MONDO:OMIM)
- Bug fix: No longer adding exact match gene annotations if >1 gene associated with MIM.
@joeflack4 joeflack4 merged commit 89d2517 into main Jun 26, 2024
@joeflack4 joeflack4 deleted the hgnc-template branch June 26, 2024 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hgnc omim
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants