Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: Build error due to breaking data model change #27

Merged
merged 1 commit into from
May 29, 2024

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented May 19, 2024

Changes

Bug fix: Field rename
- MedGen source made breaking data model change; a field was renamed. Updated code to match.

General
- Bug fixes: for 'make all', which we're not currently using. But the goal still exists and, if run, was bugging out because the dependency pipeline was broken. Certain necessary goals were commented out, and some references were incorrect.
- Update: Some files in the release/ dir were not actually release files, so they were moved to output/. Code in various files updated for this.
- Update: Added and removed some comments and todos.
- Update: Some goal prereqs to make it dependencies clearer
- Rename: goal 'sssom' to 'sssom-validate'
- Update: goal 'sssom' is now a different goal and there for convenience

@joeflack4 joeflack4 marked this pull request as draft May 19, 2024 19:54
@joeflack4 joeflack4 self-assigned this May 19, 2024
@joeflack4 joeflack4 added bug Something isn't working medgen labels May 19, 2024
@joeflack4 joeflack4 changed the title Bug fix: Build error(s) Bug fix: Build error due to breaking data model change May 19, 2024
@joeflack4 joeflack4 marked this pull request as ready for review May 19, 2024 20:39
@joeflack4 joeflack4 requested a review from twhetzel May 19, 2024 20:39
src/utils.py Outdated
df = pd.read_csv(inpath, sep='|').rename(columns={'#CUI': 'xref_id'})
df = pd.read_csv(inpath, sep='|').rename(columns={
'#CUI_or_CN_id': 'xref_id',
'#CUI': 'xref_id' # 2024/05/19: MedGen Renamed to "#CUI_or_CN_id". Leaving this in case they change back.
Copy link
Contributor Author

@joeflack4 joeflack4 May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug fix

MedGen renamed a field and the last build failed, so I needed to update the code to match.

@twhetzel After fixing, I ran a new release off this branch and it passed, so we should be good to merge this if you approve.

Copy link

@twhetzel twhetzel May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the change on Line 46 as the MedGen format or data model as you mention has changed.

nit: I would not have left Line 47 in and would prefer to review the new MedGen data format/model if it changes again in the future. However, you do have the reason for this nicely commented and so far the remaining code would be ok (albeit possibly redundant) if a CUI column were added back and only included CUIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you bring up a good point. What if they do add a CUI column and it's not the same as it was before; like it only includes CUIs and not CNs, as it was doing before? Well, then me leaving this line in here would create problems. So I just removed it now!

would prefer to review the new MedGen data format/model if it changes again in the future.

If you would like I could upload a copy of the mapping file for you to take a look at, but I didn't keep the old version from before the column name changed.

src/utils.py Outdated Show resolved Hide resolved
output/medgen.obographs.json: output/medgen-disease-extract.owl | output/
robot convert -i $< -o $@

output/medgen.sssom.tsv: output/medgen.obographs.json | output/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other, superfluous changes

@twhetzel The essential change in this PR is basically a one-liner. But I wanna explain why there's a lot in the diff.

As I was working on the bug fix I was getting other errors, so I started to fix those as well. But then I realized that these errors are actually non-issues because they were happening when I was running make all locally. However, for this repo, the build runs off of make minimal, which has no errors now that I've fixed that bug. So, in any case, these fixes and optimizations I've made will be good if we ever go back to having a use for make all. I suppose I should have split those into a separate PR, but I'm just realizing that now, sorry. If you'd like I can still do that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all the make goals still work as expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! As mentioned in my meeting, make minimal, which is for the build, works. The latest release that's currently up was created by running make minimal from this branch.

As I was explaining in the meeting, I also fixed parts of make all before remembering that I didn't need to do that. make all has been broken for a while and will continue to be broken until it ever might be needed. Though I think I really ought to do #28 (quick change) to make this clearer / not cause undue worry.

@joeflack4 joeflack4 force-pushed the build-errors branch 2 times, most recently from a508bc9 to a5fcbdb Compare May 19, 2024 21:19
Copy link

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look ok. Please do review the comments.

FYI - I am using nit as a small issue noticed in the code review.

.github/workflows/buid_and_release.yml Show resolved Hide resolved
@@ -54,8 +56,7 @@ def run(input_mappings: str = INPUT_MAPPINGS, input_sssom_config: str = INPUT_CO
# move the col removals below (umls) to above
# - add mapping_justification
df_hpo_mesh['mapping_justification'] = 'semapv:ManualMappingCuration'
write_sssom(df_hpo_mesh, input_sssom_config,
OUTPUT_FILE_HPO_MESH.replace('.sssom.tsv', '-non-matches-included.sssom.tsv'))
write_sssom(df_hpo_mesh, input_sssom_config, OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the file hpo-mesh_non-matches-included.sssom.tsv still saved to the same location so that any downstream processes that need this file can still access it?

And was the change in the filename from ...mesh-non... (with a hyphen) to ...mesh_non... (with an underscore) intentional? It looks like this is accounted for in all places I can see, but want to confirm.

nit: There are a few levels of directories to visually navigate here with OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES so the readability is a bit difficult. Since write_sssom expects outpath: Union[Path, str] you may be able to to use Path in something like full_path = Path(base_dir) / sub_dir1 / sub_dir2 / filename. However, as long as the file is still saved in the same place or any downstream processes know the new location, it is ok to leave this as-is.

Copy link
Contributor Author

@joeflack4 joeflack4 May 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the file hpo-mesh_non-matches-included.sssom.tsv still saved to the same location so that any downstream processes that need this file can still access it?
It is saved at the same location, yes! I just decided to make a variable representing its path, as opposed to the hacky way I was creating the filename/path beforehand.

FYI though, it is not actually a release artefact, and it is also not used anywhere right now. If I remember correctly, Nico asked me to filter those non-matches out of the original release artefact hpo-mesh.sssom.tsv, and I think he gave me the option to save them if I needed. IIRC, I think I wanted to upload them to a google sheet as part of some evaluation process, I don't really remember. So if we want, we probably don't even need to save them, but it's not doing any harm either. The file isn't released and it isn't committed either.

And was the change in the filename from ...mesh-non... (with a hyphen) to ...mesh_non... (with an underscore) intentional? It looks like this is accounted for in all places I can see, but want to confirm.
Yeah, it was intentional! I figured using a _ would help separate out the two parts of this filename, the hpo-mesh, which is the primary identifier for the matching set, and non-matches-included, which is more of a qualifier.

Also I checked and I did indeed rename the path consistently across the codebase (only 2 locations).

There are a few levels of directories to visually navigate here with OUTPUT_FILE_HPO_MESH_WITH_NON_MATCHES so the readability is a bit difficult.

Maybe you mean how I have this dir output/ and I also have output/release/. Yeah, this is a new pattern I'm trying. I also have in some other repo I think a setup like io/input/ and io/output/.... I've been wanting to explore these minor tweaks and see which pattern I end up liking better.

something like full_path = Path(base_dir...

Yeah, I have also been inconsistent in how I'm doing these paths in various repos / places. I am trying to always do the type declaration Union[str, Path] though, I think that's good. But I should probably use Path instead of str whenever possible, too.

output/medgen.obographs.json: output/medgen-disease-extract.owl | output/
robot convert -i $< -o $@

output/medgen.sssom.tsv: output/medgen.obographs.json | output/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all the make goals still work as expected?

- MedGen source made breaking data model change; a field was renamed. Updated code to match.

General
- Bug fixes: for 'make all', which we're not currently using. But the goal still exists and, if run, was bugging out because the dependency pipeline was broken. Certain necessary goals were commented out, and some references were incorrect.
- Update: Some files in the release/ dir were not actually release files, so they were moved to output/. Code in various files updated for this.
- Update: Added and removed some comments and todos.
- Update: Some goal prereqs to make it dependencies clearer
- Rename: goal 'sssom' to 'sssom-validate'
- Update: goal 'sssom' is now a different goal and there for convenience
@joeflack4 joeflack4 merged commit 3d1779c into main May 29, 2024
@joeflack4 joeflack4 deleted the build-errors branch May 29, 2024 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working medgen
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants