Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Synonym Sync: Duplicate rows #684

Open
joeflack4 opened this issue Nov 8, 2024 · 0 comments · May be fixed by #720
Open

Bug: Synonym Sync: Duplicate rows #684

joeflack4 opened this issue Nov 8, 2024 · 0 comments · May be fixed by #720
Assignees
Labels
bug Something isn't working

Comments

@joeflack4
Copy link
Contributor

joeflack4 commented Nov 8, 2024

Overview

I found that there are duplicate rows entering into the -confirmed template, while examining Nico's mondo PR for the confirmed cases template:

I do not know if this bug exists for the other ROBOT templates.

The bug does not have a negative consequence on processing; confirmed by checking the outputs for #8269.

The bug seems to manifest itself in that there can sometimes be multiple rows when there is a case difference between Mondo and the source. And it doesn't happen all of the time.

Example case:

mondo_id mondo_label synonym_scope synonym source_id synonym_case_diff_mondo synonym_case_diff_source
MONDO:0000001 disease oio:hasExactSynonym disease NCIT:C2991
MONDO:0000001 disease oio:hasExactSynonym disease NCIT:C2991 disease Disease
@joeflack4 joeflack4 self-assigned this Nov 8, 2024
@joeflack4 joeflack4 added the bug Something isn't working label Nov 8, 2024
@joeflack4 joeflack4 changed the title Bug: Synonym Sync: Duplicate rows (case diff) Bug: Synonym Sync: Duplicate rows Dec 6, 2024
joeflack4 added a commit that referenced this issue Dec 16, 2024
- Bug fix: Fixed case in which sources can have multiple synonyms which only vary by capitalization, causing duplicate rows to appear in the results. We don't consider source capitalization as authoritative, so these variations are only useful for analysis and should not show up as multiple rows to be processed. Thus, we now aggregate capitalization variations into the single column synonym_case_diff_source.
- Update: Added a warning just in case there are mutliple values for synonym_case_diff_mondo as well.
joeflack4 added a commit that referenced this issue Dec 16, 2024
- Bug fix: Fixed case in which sources can have multiple synonyms which only vary by capitalization, causing duplicate rows to appear in the results. We don't consider source capitalization as authoritative, so these variations are only useful for analysis and should not show up as multiple rows to be processed. Thus, we now aggregate capitalization variations into the single column synonym_case_diff_source.
- Update: Added a warning just in case there are mutliple values for synonym_case_diff_mondo as well.
joeflack4 added a commit that referenced this issue Dec 16, 2024
- Bug fix: Fixed case in which sources can have multiple synonyms which only vary by capitalization, causing duplicate rows to appear in the results. We don't consider source capitalization as authoritative, so these variations are only useful for analysis and should not show up as multiple rows to be processed. Thus, we now aggregate capitalization variations into the single column synonym_case_diff_source.
@joeflack4 joeflack4 linked a pull request Dec 16, 2024 that will close this issue
9 tasks
joeflack4 added a commit that referenced this issue Dec 16, 2024
- Minor codestyle update: Removed accidentally added \ on a line from last commit.
joeflack4 added a commit that referenced this issue Dec 16, 2024
- Bug fix: Address issues related to mondo capitalization variation.
- Update: Ensure the following columns are now in the output and are together: synonym_case_mondo, synonym_case_diff_mondo, synonym_case_mondo_is_many, synonym_case_source, synonym_case_diff_source, synonym_case_source_is_many. Note that previously, we had removed synonym_case_mondo & synonym_case_source, opting instead for the 'diff' columns, because it was previously only valuable to show the original capitalizations if there was a difference between the two. But now that we can have multiple variations in capitalization on the same syonym, it is useful to see the original case by itself, as wel as all the variations.
- Update: For synonym_case_source_is_many, ensure that all variations show up in synonym_case_source and synonym_case_diff_source columns. Note that when there are multiple capitalization variations at the source, we only need 1 row.
- Update: For all synonym_case_mondo_is_many, ensure that all variations show up in the synonym_case_diff_mondo column. But leave synonym_case_mondo as it is. We need to preserve the original case for that row, since unlike the source, we will retain multiple rows in the case that Mondo has multiple capitalization variations for a single synonym.
joeflack4 added a commit that referenced this issue Dec 16, 2024
- Bug fix: Address issues related to mondo capitalization variation.
- Update: Ensure the following columns are now in the output and are together: synonym_case_mondo, synonym_case_diff_mondo, synonym_case_mondo_is_many, synonym_case_source, synonym_case_diff_source, synonym_case_source_is_many. Note that previously, we had removed synonym_case_mondo & synonym_case_source, opting instead for the 'diff' columns, because it was previously only valuable to show the original capitalizations if there was a difference between the two. But now that we can have multiple variations in capitalization on the same syonym, it is useful to see the original case by itself, as wel as all the variations.
- Update: For synonym_case_source_is_many, ensure that all variations show up in synonym_case_source and synonym_case_diff_source columns. Note that when there are multiple capitalization variations at the source, we only need 1 row.
- Update: For all synonym_case_mondo_is_many, ensure that all variations show up in the synonym_case_diff_mondo column. But leave synonym_case_mondo as it is. We need to preserve the original case for that row, since unlike the source, we will retain multiple rows in the case that Mondo has multiple capitalization variations for a single synonym.
joeflack4 added a commit that referenced this issue Dec 16, 2024
- Delete: Some temporary, analytical code.
joeflack4 added a commit that referenced this issue Dec 17, 2024
- Bug fix: Sometimes the multi-value source synonym field would be erroneously set as 'synonym'.
joeflack4 added a commit that referenced this issue Dec 18, 2024
- Bug fix on last bug fix: Fixed a KeyError that occurs in the -added template.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant