-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove flags from block_using_rules_sqls
logic ( _find_new_matches_mode
and _compare_two_records_mode
etc.)
#2129
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
RobinL
changed the title
(WIP) Remove flags from
Remove flags from Apr 5, 2024
block_using_rules_sqls
logic ( _find_new_matches_mode
and _compare_two_records_mode
)block_using_rules_sqls
logic ( _find_new_matches_mode
and _compare_two_records_mode
etc.)
ADBond
approved these changes
Apr 22, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this! Feels like a big QoL improvement for following the logic
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR removes the following flags from the codebase:
linker._analyse_blocking_mode
linker.__two_dataset_link_only
linker._find_new_matches_mode
linker._compare_two_records_mode
linker._train_u_using_random_sample_mode
linker._self_link_mode
linker._deterministic_link_mode
Instead, the required behaviour is achieved using function arguments primarily to
block_using_rules_sqls
Details
At the moment a variety of boolean flags are set on the linker that modify the behaviour of sql generating functions.
It's difficult to understand the consequences of these flags because rather than the calling function 'asking' for specific behaviour by explicitly setting function arguments, the behaviour occurs implicitly via the flag.
For example, in
linker.estimate_u_using_random_sampling
, two modifications are made to blocking:estimate_u_using_random_sampling
performs a cartesian join)But this behaviour is achieved by setting:
training_linker._train_u_using_random_sample_mode = True
and then in
blocking.block_using_rules_sqls
we have code that runs conditional logic likeif linker._train_u_using_random_sample_mode
e.g. here.It would be much clearer if the
estimate_u_using_random_sampling
function explicitly took the samples, added the sqls to the pipeline, and then calledblocking.block_using_rules_sqls
passing in the name of the sampled tables and a null (full cartesian join) blocking rule.Another example is how table names are interpolated into
block_using_rules_sqls
e.g. here:
splink/splink/blocking.py
Line 251 in 4402316
Which has lines like:
splink/splink/linker.py
Lines 334 to 345 in 4402316
and is very confusing to work out what table name will be interpolated and why. Instead, within the
linker.compare_two_records
functions, it would be better to simply explicitly pass tablenames like__splink__compare_two_records_left_with_tf
intoblock_using_rules_sqls