Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare two records - allow dataframes to be registered #2493

Merged
merged 1 commit into from
Nov 6, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Nov 6, 2024

Here's a reprex of the current inability to use the function if you use date types:

Click to expand
import datetime

import pandas as pd

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.exploratory import profile_columns

db_api = DuckDBAPI()
con = db_api._con

df = splink_datasets.fake_1000
df.loc[df.index[::2], "dob"] = None
profile_columns(df, db_api, column_expressions=["dob"])

sql = """
create table in_data
as
select
* exclude(dob), cast(dob as date) as dob
from df
"""
con.execute(sql)


settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.DateOfBirthComparison("dob", input_is_string=False),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
    ],
    max_iterations=2,
    retain_intermediate_calculation_columns=True,
    retain_matching_columns=True,
)

linker = Linker("in_data", settings, db_api)

linker.training.estimate_probability_two_random_records_match(
    [block_on("first_name", "surname")], recall=0.7
)

linker.training.estimate_u_using_random_sampling(max_pairs=1e6)

linker.training.estimate_parameters_using_expectation_maximisation(block_on("dob"))
df.head(2).to_dict(orient="records")
r1 = {
    "unique_id": 0,
    "first_name": "Robert",
    "surname": "Alan",
    "dob": None,
    "city": None,
    "email": "[email protected]",
}

r2 = {
    "unique_id": 1,
    "first_name": "Robert",
    "surname": "Allen",
    "dob": datetime.date(1971, 5, 24),
    "city": None,
    "email": "[email protected]",
}


linker.inference.compare_two_records(r1, r2)

This PR allows like:


# How about if the records are duckdbpyrelation
sql = """
select *
from in_data
limit 1
"""
r1 = con.sql(sql)

sql = """
select *
from in_data
limit 1 offset 1
"""
r2 = con.sql(sql)

linker.inference.compare_two_records(r1, r2)

See #2423 - a fix for the Spark issues is coming in #2426

@RobinL RobinL merged commit fd63f5b into master Nov 6, 2024
25 checks passed
@RobinL RobinL deleted the find_two_records_accepts_any_registerable_data branch November 6, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant