Refactor blocking to not need linker #2180

RobinL · 2024-05-14T13:24:41Z

Summary

This PR redesigns the user-facing API for the analysis of blocking rules.

It allows the analysis of blocking rules to be conducted without the user needing to have constructed a settings object.
It consolidates existing functions into just two user-facing functions without reducing functionality

Existing functions

At the moment we have the following functions for analysing blocking rules:

linker.count_num_comparisons_from_blocking_rule
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions
linker.cumulative_comparisons_from_blocking_rules_records
linker.cumulative_num_comparisons_from_blocking_rules_chart
linker.count_num_comparisons_from_blocking_rules_for_prediction

1 and 2 perform have the same objective, but they they differ in the definition and speed of the computation and the result, see here.

Functions 3,4 and 5 all perform the same underlying analysis, they only vary in the presentation.

Proposal

Blocking analysis consolidated into:

analyse_blocking.count_comparisons_from_blocking_rule
analyse_blocking.cumulative_comparisons_to_be_scored_from_blocking_rules_data
analyse_blocking.cumulative_comparisons_to_be_scored_from_blocking_rules_chart

And remove these from the linker

example

# %load_ext autoreload
# %autoreload 2
import logging

import pandas as pd
from IPython.display import display

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on
from splink.analyse_blocking import (
    count_comparisons_from_blocking_rule,
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
    cumulative_comparisons_to_be_scored_from_blocking_rules_data,
)


# fmt: off
data = pd.DataFrame(
    [
        {"unique_id": "1", "source_dataset": "s", "first_name": "a", "surname": "x","postcode": ["1", "2"]},
        {"unique_id": "2", "source_dataset": "s", "first_name": "b", "surname": "y","postcode": ["2", "3"]},
        {"unique_id": "3", "source_dataset": "s", "first_name": "c", "surname": "z","postcode": ["3"]},
        {"unique_id": "4", "source_dataset": "s", "first_name": "d", "surname": "p","postcode": ["5"]},
        {"unique_id": "5", "source_dataset": "s", "first_name": "d", "surname": "p","postcode": ["6"]},
        {"unique_id": "6", "source_dataset": "s", "first_name": "e", "surname": "p","postcode": ["7"]},

    ]

)

# fmt: on
blocking_rules_as_creators = [
    block_on("first_name"),
    block_on("substr(first_name,1,3)"),
    {
        "blocking_rule": "l.postcode = r.postcode",
        "arrays_to_explode": ["postcode"],
    },
    block_on("surname"),
]


result = count_comparisons_from_blocking_rule(
    table_or_tables=data,
    blocking_rule="l.first_name = r.first_name and len(l.surname) > 1",
    link_type="dedupe_only",
    db_api=DuckDBAPI(),
    compute_post_filter_count=True,
    unique_id_column_name="unique_id",
    max_rows_limit=3e9,
)
display(result)

result = cumulative_comparisons_to_be_scored_from_blocking_rules_data(
    table_or_tables=data,
    blocking_rule_creators=blocking_rules_as_creators,
    link_type="dedupe_only",
    db_api=DuckDBAPI(),
    unique_id_column_name="unique_id",
    source_dataset_column_name="source_dataset",
    max_rows_limit=3e9,
)
display(result)

result = cumulative_comparisons_to_be_scored_from_blocking_rules_chart(
    table_or_tables=data,
    blocking_rule_creators=blocking_rules_as_creators,
    link_type="dedupe_only",
    db_api=DuckDBAPI(),
    unique_id_column_name="unique_id",
    source_dataset_column_name="source_dataset",
)
display(result)

settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules_as_creators,
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ArrayIntersectAtSizes("postcode"),
    ],
)

linker = Linker(data, settings=settings, database_api=DuckDBAPI())

linker.estimate_probability_two_random_records_match(
    deterministic_matching_rules=[
        block_on("first_name"),
        {
            "blocking_rule": "l.postcode = r.postcode",
            "arrays_to_explode": ["postcode"],
        },
        block_on("surname"),
    ],
    recall=0.7,
    max_rows_limit=3e9,
)
linker.deterministic_link().as_pandas_dataframe()
linker.predict().as_pandas_dataframe()

Discussion

There are two main complexities in analyzing blocking rules:

Interactions between blocking rules: When multiple blocking rules are applied, some may generate the same comparisons. Consequently, when analyzing multiple blocking rules, we are usually interested in the cumulative number of comparisons after deduplication.
Different methods for counting comparisons: We have both a 'fast/estimate' method (pre filter conditions) and a 'slow/precise' method (post filter conditions) for counting the number of comparisons generated by a single blocking rule. Only the 'slow/precise' method can detect duplicates across blocking rules and therefore can be used to compute cumulative comparisons.

Fast/Estimate vs. Slow/Precise Methods

The 'fast/estimate' method is crucial for identifying blocking rules that generate an infeasible number of comparisons (e.g., block_on("first_name") could generate 1 trillion comparisons). Using the 'slow/precise' method to detect this would be impractical because the computation might never complete. Thus, it is generally desirable to test everything using the 'fast/estimate' method to ensure the user is not asking for a computation that will never finish.

TODO/Check

…dataset column name

refactor blocking to not need linker

6c0c1b7

RobinL mentioned this pull request May 14, 2024

(WIP) Blocking api redesign #2136

Closed

4 tasks

RobinL added 28 commits May 14, 2024 15:52

refactor estimate u to use new block_using_rules_sqls

c6fdb0a

refactor estimate u to use new block_using_rules_sqls

a38033c

test fixes

7eebdcc

fix rr tests

9220f3c

fix

196b1fc

deal with case of no matches

3460266

fix determinisic link test

fe93bd4

fix unlinkables

e8e9d2b

fix tests of efficient join types

36e5eb7

unlinkables

309e02d

find matches to new records

9d485e0

m training

a553b5c

fix test new db api test

4a45484

duckdb tests pass

3b0056f

start to fix analyse blocking tests

7ddf4e2

br tests

5a35dc3

fix test total comparison count

701390a

more fixes to test analyse blocking

c6ae3ce

test edge cases

55a1db4

final fixes to test analyse blocking

71c9fee

fix full example postgres

6b78f9b

fix postgres test

4468b9d

fix case of no matches returned

004ed96

fix autofixable

2343c18

formatting

6318189

fix mypy errors

505ad99

more mypy

96af276

fix array exlode test

84c19cb

RobinL added 21 commits May 15, 2024 14:09

rename arg

59bdc38

fix tests

bacf875

fix compare two records

16c8c2c

fix linker mypy

a952943

all mypy except auto blocking

1ec0695

final mypy errors

efa6435

rename module to blocking analysis

8a1ff88

move files

2a896e6

aliases for public api

afe1d9c

update blocking notebook

b288849

tests pass again

5eb1f5d

deterministic dedupe example

75bbe66

convert more notebooks

ea1a264

fix more notebooks

f393bd5

check source dataset works as intended

4b1d009

deal with different intput types, including single table with source …

1899563

…dataset column name

fix bugs introduced by none inputcolumn

95a92ff

add further tests

01b8d6e

fix tests

ef7fdd8

return type is typle

acd5074

mypy passes

062b0f0

RobinL changed the title ~~refactor blocking to not need linker~~ Refactor blocking to not need linker May 16, 2024

RobinL added 6 commits May 16, 2024 13:14

update notebooks

4861c7e

ensure is iterable

4910795

add where filter condition to output

4afc9e2

fix tests

b362b76

rename api

8290dfb

rename api

a6b3176

RobinL merged commit 7d4a601 into splink4_dev May 16, 2024
25 checks passed

RobinL deleted the blocking_api_redesign_3 branch May 16, 2024 13:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor blocking to not need linker #2180

Refactor blocking to not need linker #2180

RobinL commented May 14, 2024 •

edited

Loading

Refactor blocking to not need linker #2180

Refactor blocking to not need linker #2180

Conversation

RobinL commented May 14, 2024 • edited Loading

Summary

Existing functions

Proposal

Discussion

Fast/Estimate vs. Slow/Precise Methods

TODO/Check

RobinL commented May 14, 2024 •

edited

Loading