Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor blocking to not need linker #2180

Merged
merged 59 commits into from
May 16, 2024
Merged

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented May 14, 2024

Summary

This PR redesigns the user-facing API for the analysis of blocking rules.

  • It allows the analysis of blocking rules to be conducted without the user needing to have constructed a settings object.
  • It consolidates existing functions into just two user-facing functions without reducing functionality

Existing functions

At the moment we have the following functions for analysing blocking rules:

  1. linker.count_num_comparisons_from_blocking_rule
  2. linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions
  3. linker.cumulative_comparisons_from_blocking_rules_records
  4. linker.cumulative_num_comparisons_from_blocking_rules_chart
  5. linker.count_num_comparisons_from_blocking_rules_for_prediction

1 and 2 perform have the same objective, but they they differ in the definition and speed of the computation and the result, see here.

Functions 3,4 and 5 all perform the same underlying analysis, they only vary in the presentation.

Proposal

Blocking analysis consolidated into:

analyse_blocking.count_comparisons_from_blocking_rule
analyse_blocking.cumulative_comparisons_to_be_scored_from_blocking_rules_data
analyse_blocking.cumulative_comparisons_to_be_scored_from_blocking_rules_chart

And remove these from the linker

example
# %load_ext autoreload
# %autoreload 2
import logging

import pandas as pd
from IPython.display import display

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on
from splink.analyse_blocking import (
    count_comparisons_from_blocking_rule,
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
    cumulative_comparisons_to_be_scored_from_blocking_rules_data,
)


# fmt: off
data = pd.DataFrame(
    [
        {"unique_id": "1", "source_dataset": "s", "first_name": "a", "surname": "x","postcode": ["1", "2"]},
        {"unique_id": "2", "source_dataset": "s", "first_name": "b", "surname": "y","postcode": ["2", "3"]},
        {"unique_id": "3", "source_dataset": "s", "first_name": "c", "surname": "z","postcode": ["3"]},
        {"unique_id": "4", "source_dataset": "s", "first_name": "d", "surname": "p","postcode": ["5"]},
        {"unique_id": "5", "source_dataset": "s", "first_name": "d", "surname": "p","postcode": ["6"]},
        {"unique_id": "6", "source_dataset": "s", "first_name": "e", "surname": "p","postcode": ["7"]},

    ]

)

# fmt: on
blocking_rules_as_creators = [
    block_on("first_name"),
    block_on("substr(first_name,1,3)"),
    {
        "blocking_rule": "l.postcode = r.postcode",
        "arrays_to_explode": ["postcode"],
    },
    block_on("surname"),
]


result = count_comparisons_from_blocking_rule(
    table_or_tables=data,
    blocking_rule="l.first_name = r.first_name and len(l.surname) > 1",
    link_type="dedupe_only",
    db_api=DuckDBAPI(),
    compute_post_filter_count=True,
    unique_id_column_name="unique_id",
    max_rows_limit=3e9,
)
display(result)

result = cumulative_comparisons_to_be_scored_from_blocking_rules_data(
    table_or_tables=data,
    blocking_rule_creators=blocking_rules_as_creators,
    link_type="dedupe_only",
    db_api=DuckDBAPI(),
    unique_id_column_name="unique_id",
    source_dataset_column_name="source_dataset",
    max_rows_limit=3e9,
)
display(result)

result = cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=data,
    blocking_rule_creators=blocking_rules_as_creators,
    link_type="dedupe_only",
    db_api=DuckDBAPI(),
    unique_id_column_name="unique_id",
    source_dataset_column_name="source_dataset",
)
display(result)

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules_as_creators,
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ArrayIntersectAtSizes("postcode"),
    ],
)

linker = Linker(data, settings=settings, database_api=DuckDBAPI())

linker.estimate_probability_two_random_records_match(
    deterministic_matching_rules=[
        block_on("first_name"),
        {
            "blocking_rule": "l.postcode = r.postcode",
            "arrays_to_explode": ["postcode"],
        },
        block_on("surname"),
    ],
    recall=0.7,
    max_rows_limit=3e9,
)
linker.deterministic_link().as_pandas_dataframe()
linker.predict().as_pandas_dataframe()

Discussion

There are two main complexities in analyzing blocking rules:

  1. Interactions between blocking rules: When multiple blocking rules are applied, some may generate the same comparisons. Consequently, when analyzing multiple blocking rules, we are usually interested in the cumulative number of comparisons after deduplication.

  2. Different methods for counting comparisons: We have both a 'fast/estimate' method (pre filter conditions) and a 'slow/precise' method (post filter conditions) for counting the number of comparisons generated by a single blocking rule. Only the 'slow/precise' method can detect duplicates across blocking rules and therefore can be used to compute cumulative comparisons.

Fast/Estimate vs. Slow/Precise Methods

The 'fast/estimate' method is crucial for identifying blocking rules that generate an infeasible number of comparisons (e.g., block_on("first_name") could generate 1 trillion comparisons). Using the 'slow/precise' method to detect this would be impractical because the computation might never complete. Thus, it is generally desirable to test everything using the 'fast/estimate' method to ensure the user is not asking for a computation that will never finish.

TODO/Check

  • consistency beteween count_comparisons_from_blocking_rule and cumulative_ fns in their treatment of source_dataset
  • Add test that source_dataset_column_name works properly on cumulative_comparisons_to_be_scored_from_blocking_rules_data when e.g. you have a single input dataframe with a preexisting column name
  • - Tidy up logic for making BlockingRule/BlockingRuleCreator uniform. Do type hinting
  • - Move internal logic to splink.internals
  • - Check both pure function, and use within 'estimate_probability_two_random_records_match' are consistent and correct in terms of the count of marginal rules, by comparing against Splink 3
  • - Check it works with spark and exploding blocking rules
  • - Check with works with duckdb irrespective of the type of input data (pandas df, duckdb tablename, etc)
  • Type hint block_using_rules_sqls correctly and check arguments make sense
  • Check deterministic_link works correctly with two dataset link only
  • Check predict works correctly with two dataset link only
  • Check estimate_probability_two_random_records_match works
  • Find other uses of block_using_rules_sqls
  • Use ensure_is_iterable to allow estimate_probability_two_random_records_match to take a single br and cumulative_comparisons_to_be_scored_from_blocking_rules_data
  • count_comparisons_from_blocking_rule should report join condition

@RobinL RobinL mentioned this pull request May 14, 2024
4 tasks
@RobinL RobinL changed the title refactor blocking to not need linker Refactor blocking to not need linker May 16, 2024
@RobinL RobinL merged commit 7d4a601 into splink4_dev May 16, 2024
25 checks passed
@RobinL RobinL deleted the blocking_api_redesign_3 branch May 16, 2024 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant