Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mapping set inversion #555

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

gouttegd
Copy link
Contributor

This is a putative fix to #554.

It implements the fix suggested in this comment, and adds a test case to check that inverting a set containing “imbalanced” columns (e.g., with a subject_label column but without a object_label column) does not trigger the #554 bug.

When inverting a data frame, the reordering of columns post-inversion
should not rely on the columns that existed in the original data frame
(pre-inversion), because it is not guaranted that all columns in the
original data frame will exist in the inverted data frame: if the
original frame contained, e.g., a `subject_label` column but did not
contain a `object_label` column, then the inverted frame will contain an
`object_label` column but will _not_ contain a `subject_label` column.

Instead, we use the dedicated sort_df_rows_columns function to perform
the reordering. We sort the columns only, as there should be no need at
this stage to sort the rows.
We add a new test file containing an example of an "asymmetric" or
"imbalanced" set: that is, a set where not all "subject_*" columns have
an equivalent "object_*" column and vice versa.

The set is derived from the "basic.tsv" set by: (1) removing most of the
set metadata (not needed for the test); (2) removing most of the rows,
leaving just a handful (more than enough); (3) removing most of the
columns. In the columns we keep, we specifically keep "subject_label"
but NOT "object_label" (thereby creating the "asymmetry"), and
conversely we keep "object_source" but NOT "subject_source" (another
asymmetry).

We then add a new test case in tests/test_utils.py in which we try
inverting the asymmetric test (checking that it does not trigger the
issue mapping-commons#554) and check that the inverted object labels correspond to the
original subject labels.
self.assertEqual(len(inverted_df), len(msdf.df))
original_subject_labels = msdf.df["subject_label"].values
inverted_object_labels = inverted_df["object_label"].values
self.assertNotIn(False, original_subject_labels == inverted_object_labels)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have tried to use something like assertSequenceEqual for this but I zero priority as the current looks correct

@matentzn matentzn merged commit a0d215b into mapping-commons:master Oct 24, 2024
6 checks passed
@gouttegd gouttegd deleted the fix-mapping-set-inversion branch October 24, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants