Compare two records - allow dataframes to be registered #2493

RobinL · 2024-11-06T12:27:16Z

Here's a reprex of the current inability to use the function if you use date types:

Click to expand

import datetime

import pandas as pd

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.exploratory import profile_columns

db_api = DuckDBAPI()
con = db_api._con

df = splink_datasets.fake_1000
df.loc[df.index[::2], "dob"] = None
profile_columns(df, db_api, column_expressions=["dob"])

sql = """
create table in_data
as
select
* exclude(dob), cast(dob as date) as dob
from df
"""
con.execute(sql)


settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.DateOfBirthComparison("dob", input_is_string=False),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    max_iterations=2,
    retain_intermediate_calculation_columns=True,
    retain_matching_columns=True,
)

linker = Linker("in_data", settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")], recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
df.head(2).to_dict(orient="records")
r1 = {
    "unique_id": 0,
    "first_name": "Robert",
    "surname": "Alan",
    "dob": None,
    "city": None,
    "email": "[email protected]",
}

r2 = {
    "unique_id": 1,
    "first_name": "Robert",
    "surname": "Allen",
    "dob": datetime.date(1971, 5, 24),
    "city": None,
    "email": "[email protected]",
}


linker.inference.compare_two_records(r1, r2)

This PR allows like:


# How about if the records are duckdbpyrelation
sql = """
select *
from in_data
limit 1
"""
r1 = con.sql(sql)

sql = """
select *
from in_data
limit 1 offset 1
"""
r2 = con.sql(sql)

linker.inference.compare_two_records(r1, r2)

See #2423 - a fix for the Spark issues is coming in #2426

allow any data to be registered

dbb4115

RobinL merged commit fd63f5b into master Nov 6, 2024
25 checks passed

RobinL deleted the find_two_records_accepts_any_registerable_data branch November 6, 2024 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare two records - allow dataframes to be registered #2493

Compare two records - allow dataframes to be registered #2493

RobinL commented Nov 6, 2024 •

edited

Loading

Compare two records - allow dataframes to be registered #2493

Compare two records - allow dataframes to be registered #2493

Conversation

RobinL commented Nov 6, 2024 • edited Loading

RobinL commented Nov 6, 2024 •

edited

Loading