Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove flags from block_using_rules_sqls logic ( _find_new_matches_mode and _compare_two_records_mode etc.) #2129

Merged
merged 19 commits into from
Apr 22, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Apr 3, 2024

Summary

This PR removes the following flags from the codebase:

  • linker._analyse_blocking_mode
  • linker.__two_dataset_link_only
  • linker._find_new_matches_mode
  • linker._compare_two_records_mode
  • linker._train_u_using_random_sample_mode
  • linker._self_link_mode
  • linker._deterministic_link_mode

Instead, the required behaviour is achieved using function arguments primarily to block_using_rules_sqls

Details

At the moment a variety of boolean flags are set on the linker that modify the behaviour of sql generating functions.

It's difficult to understand the consequences of these flags because rather than the calling function 'asking' for specific behaviour by explicitly setting function arguments, the behaviour occurs implicitly via the flag.

For example, in linker.estimate_u_using_random_sampling, two modifications are made to blocking:

  1. We need to take a sample of the input records
  2. We need to eliminate blocking rules (because estimate_u_using_random_sampling performs a cartesian join)

But this behaviour is achieved by setting:
training_linker._train_u_using_random_sample_mode = True

and then in blocking.block_using_rules_sqls we have code that runs conditional logic like if linker._train_u_using_random_sample_mode e.g. here.

It would be much clearer if the estimate_u_using_random_sampling function explicitly took the samples, added the sqls to the pipeline, and then called blocking.block_using_rules_sqls passing in the name of the sampled tables and a null (full cartesian join) blocking rule.

Another example is how table names are interpolated into block_using_rules_sqls

e.g. here:

from {linker._input_tablename_l} as l

Which has lines like:

splink/splink/linker.py

Lines 334 to 345 in 4402316

def _input_tablename_l(self):
if self._find_new_matches_mode:
return "__splink__df_concat_with_tf"
if self._self_link_mode:
return "__splink__df_concat_with_tf"
if self._compare_two_records_mode:
return "__splink__compare_two_records_left_with_tf"
if self._train_u_using_random_sample_mode:
if self._two_dataset_link_only:

and is very confusing to work out what table name will be interpolated and why. Instead, within the linker.compare_two_records functions, it would be better to simply explicitly pass tablenames like __splink__compare_two_records_left_with_tf into block_using_rules_sqls

@RobinL RobinL changed the base branch from master to splink4_dev April 3, 2024 14:01
@RobinL RobinL changed the title (WIP) Remove flags from block_using_rules_sqls logic ( _find_new_matches_mode and _compare_two_records_mode) Remove flags from block_using_rules_sqls logic ( _find_new_matches_mode and _compare_two_records_mode etc.) Apr 5, 2024
@RobinL RobinL requested a review from ADBond April 5, 2024 09:19
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this! Feels like a big QoL improvement for following the logic

@RobinL RobinL merged commit 81f457f into splink4_dev Apr 22, 2024
15 checks passed
@RobinL RobinL deleted the remove_flags branch April 22, 2024 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants