Cluster without linker #2412

RobinL · 2024-09-18T09:05:33Z

We've heard from several people who want to cluster without a linker. For instance if you are combining predictions from multiple models and want to cluster. e.g. #2358

This PR allows the clustering algorithm to be used without needing a linker, similar to exploratory analysis

Example without linker

from duckdb import DuckDBPyRelation

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.internals.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

nodes = [
    {"my_id": 1},
    {"my_id": 2},
    {"my_id": 3},
    {"my_id": 4},
    {"my_id": 5},
    {"my_id": 6},
]

edges = [
    {"n_1": 1, "n_2": 2, "match_probability": 0.8},
    {"n_1": 3, "n_2": 2, "match_probability": 0.9},
    {"n_1": 4, "n_2": 5, "match_probability": 0.99},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="my_id",
    edge_id_column_name_left="n_1",
    edge_id_column_name_right="n_2",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()

nodes = [
    {"abc": 1},
    {"abc": 2},
    {"abc": 3},
    {"abc": 4},
]

edges = [
    {"abc_l": 1, "abc_r": 2, "match_probability": 0.8},
    {"abc_l": 3, "abc_r": 2, "match_probability": 0.9},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="abc",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()

Example

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.internals.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

df = splink_datasets.fake_1000

# split df into two dfs with modulo 2
df_1 = df[df["unique_id"] % 2 == 0]
df_2 = df[df["unique_id"] % 2 == 1]

settings = SettingsCreator(
    link_type="link_and_dedupe",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city").configure(term_frequency_adjustments=True),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[
        block_on("first_name"),
        block_on("surname"),
        block_on("dob"),
        block_on("city"),
        block_on("email"),
    ],
    max_iterations=2,
)

linker = Linker([df_1, df_2], settings, db_api, input_table_aliases=["a", "b"])
linker._settings_obj._get_source_dataset_column_name_is_required()
pairwise_predictions = linker.inference.predict(threshold_match_weight=-10)
pairwise_predictions.as_pandas_dataframe().sort_values(["unique_id_l", "unique_id_r"])
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    pairwise_predictions, 0.00001
)


cluster_pairwise_predictions_at_threshold(
    df,
    pairwise_predictions.physical_name,
    node_id_column_name="unique_id",
    db_api=db_api,
    threshold_match_probability=0.00001,
).as_pandas_dataframe()

Also works for deterministic linking

import os

import pandas as pd

from splink import DuckDBAPI, Linker, SettingsCreator
from splink.blocking_analysis import (
    cumulative_comparisons_to_be_scored_from_blocking_rules_chart,
)

# Load the data
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

# Define blocking rules
br_for_predict = [
    "l.first_name = r.first_name and l.surname = r.surname and l.dob = r.dob",
    "l.surname = r.surname and l.dob = r.dob and l.email = r.email",
    "l.first_name = r.first_name and l.surname = r.surname and l.email = r.email",
]

# Create settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=br_for_predict,
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

# Initialize DuckDB API
db_api = DuckDBAPI()


# Create linker
linker = Linker(df, settings, db_api=db_api)

# Perform deterministic linking
df_predict = linker.inference.deterministic_link()

# Cluster predictions
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(
    df_predict,
)

clusters.as_pandas_dataframe()

RobinL · 2024-09-18T10:10:34Z

splink/internals/clustering.py

+
+
+def cluster_pairwise_predictions_at_threshold(
+    nodes: AcceptableInputTableType,


This should probably eventually allow the input to also be SplinkDataFrame, but i think that's for a wider PR which allows all public-API functions to accept SplinkDataFrames

RobinL · 2024-09-20T18:54:46Z

docs/demos/examples/duckdb/deterministic_dedupe.ipynb

match probability = 1 hack no longer required due to this refactor

RobinL added 5 commits September 18, 2024 07:52

solve connected components without linker

698b1cb

clustering.py

2092b7f

linker clustering works again

3c0c2e6

tests work with new framework

2c16b23

allow clustering without match prob

b2b7cd1

RobinL commented Sep 18, 2024

View reviewed changes

RobinL added 4 commits September 18, 2024 11:14

remove errant comma

43dac2a

mypy

bfeb572

compute_graph_metrics works again

7a28c97

fix tests

1bff8a2

RobinL commented Sep 20, 2024

View reviewed changes

docs/demos/examples/duckdb/deterministic_dedupe.ipynb

Copy link

Member Author

RobinL Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match probability = 1 hack no longer required due to this refactor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster without linker #2412

Cluster without linker #2412

RobinL commented Sep 18, 2024 •

edited

Loading

RobinL Sep 18, 2024

RobinL Sep 20, 2024



		def cluster_pairwise_predictions_at_threshold(
		nodes: AcceptableInputTableType,

Cluster without linker #2412

Are you sure you want to change the base?

Cluster without linker #2412

Conversation

RobinL commented Sep 18, 2024 • edited Loading

RobinL Sep 18, 2024

Choose a reason for hiding this comment

RobinL Sep 20, 2024

Choose a reason for hiding this comment

RobinL commented Sep 18, 2024 •

edited

Loading