-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor blocking to not need linker #2180
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…dataset column name
RobinL
changed the title
refactor blocking to not need linker
Refactor blocking to not need linker
May 16, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR redesigns the user-facing API for the analysis of blocking rules.
Existing functions
At the moment we have the following functions for analysing blocking rules:
linker.count_num_comparisons_from_blocking_rule
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions
linker.cumulative_comparisons_from_blocking_rules_records
linker.cumulative_num_comparisons_from_blocking_rules_chart
linker.count_num_comparisons_from_blocking_rules_for_prediction
1 and 2 perform have the same objective, but they they differ in the definition and speed of the computation and the result, see here.
Functions 3,4 and 5 all perform the same underlying analysis, they only vary in the presentation.
Proposal
Blocking analysis consolidated into:
And remove these from the linker
example
Discussion
There are two main complexities in analyzing blocking rules:
Interactions between blocking rules: When multiple blocking rules are applied, some may generate the same comparisons. Consequently, when analyzing multiple blocking rules, we are usually interested in the cumulative number of comparisons after deduplication.
Different methods for counting comparisons: We have both a 'fast/estimate' method (pre filter conditions) and a 'slow/precise' method (post filter conditions) for counting the number of comparisons generated by a single blocking rule. Only the 'slow/precise' method can detect duplicates across blocking rules and therefore can be used to compute cumulative comparisons.
Fast/Estimate vs. Slow/Precise Methods
The 'fast/estimate' method is crucial for identifying blocking rules that generate an infeasible number of comparisons (e.g.,
block_on("first_name")
could generate 1 trillion comparisons). Using the 'slow/precise' method to detect this would be impractical because the computation might never complete. Thus, it is generally desirable to test everything using the 'fast/estimate' method to ensure the user is not asking for a computation that will never finish.TODO/Check
count_comparisons_from_blocking_rule
andcumulative_
fns in their treatment ofsource_dataset
cumulative_comparisons_to_be_scored_from_blocking_rules_data
when e.g. you have a single input dataframe with a preexisting column nameblock_using_rules_sqls
correctly and check arguments make sensedeterministic_link
works correctly with two dataset link onlypredict
works correctly with two dataset link onlyestimate_probability_two_random_records_match
worksblock_using_rules_sqls
ensure_is_iterable
to allowestimate_probability_two_random_records_match
to take a single br andcumulative_comparisons_to_be_scored_from_blocking_rules_data
count_comparisons_from_blocking_rule
should report join condition