Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster without linker #2412

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Cluster without linker #2412

wants to merge 9 commits into from

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Sep 18, 2024

We've heard from several people who want to cluster without a linker. For instance if you are combining predictions from multiple models and want to cluster. e.g. #2358

This PR allows the clustering algorithm to be used without needing a linker, similar to exploratory analysis

Example without linker
from duckdb import DuckDBPyRelation

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.internals.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

nodes = [
    {"my_id": 1},
    {"my_id": 2},
    {"my_id": 3},
    {"my_id": 4},
    {"my_id": 5},
    {"my_id": 6},
]

edges = [
    {"n_1": 1, "n_2": 2, "match_probability": 0.8},
    {"n_1": 3, "n_2": 2, "match_probability": 0.9},
    {"n_1": 4, "n_2": 5, "match_probability": 0.99},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="my_id",
    edge_id_column_name_left="n_1",
    edge_id_column_name_right="n_2",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()

nodes = [
    {"abc": 1},
    {"abc": 2},
    {"abc": 3},
    {"abc": 4},
]

edges = [
    {"abc_l": 1, "abc_r": 2, "match_probability": 0.8},
    {"abc_l": 3, "abc_r": 2, "match_probability": 0.9},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="abc",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()
Example
import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.internals.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

# split df into two dfs with modulo 2
df_1 = df[df["unique_id"] % 2 == 0]
df_2 = df[df["unique_id"] % 2 == 1]

settings = SettingsCreator(
    link_type="link_and_dedupe",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
        block_on("dob"),
        block_on("city"),
        block_on("email"),
    ],
    max_iterations=2,
)

linker = Linker([df_1, df_2], settings, db_api, input_table_aliases=["a", "b"])
linker._settings_obj._get_source_dataset_column_name_is_required()
pairwise_predictions = linker.inference.predict(threshold_match_weight=-10)
pairwise_predictions.as_pandas_dataframe().sort_values(["unique_id_l", "unique_id_r"])
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.00001
)


cluster_pairwise_predictions_at_threshold(
    df,
    pairwise_predictions.physical_name,
    node_id_column_name="unique_id",
    db_api=db_api,
    threshold_match_probability=0.00001,
).as_pandas_dataframe()
Also works for deterministic linking
import os

import pandas as pd

from splink import DuckDBAPI, Linker, SettingsCreator
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

# Load the data
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

# Define blocking rules
br_for_predict = [
    "l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob",
    "l.surname = r.surname and l.dob = r.dob and l.email = r.email",
    "l.first_name = r.first_name and l.surname = r.surname and l.email = r.email",
]

# Create settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=br_for_predict,
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

# Initialize DuckDB API
db_api = DuckDBAPI()


# Create linker
linker = Linker(df, settings, db_api=db_api)

# Perform deterministic linking
df_predict = linker.inference.deterministic_link()

# Cluster predictions
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict,
)

clusters.as_pandas_dataframe()



def cluster_pairwise_predictions_at_threshold(
nodes: AcceptableInputTableType,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably eventually allow the input to also be SplinkDataFrame, but i think that's for a wider PR which allows all public-API functions to accept SplinkDataFrames

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match probability = 1 hack no longer required due to this refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant