Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058

RobinL · 2024-03-14T09:52:01Z

This PR ensures that we delete tables created by the profiling code so the database doesn't get littered with lots of temp tables.

This does mean they won't exist in the cache, meaning if for some reason want to call profile_columns twice, the second one call won't be able to use the cache. But I can't see when that'd matter.

example

import logging

import duckdb

import splink.comparison_library as cl
from splink import (
    DuckDBAPI,
    SettingsCreator,
    block_on,
    splink_datasets,
)
from splink.blocking_rule_library import block_on
from splink.datasets import splink_datasets
from splink.profile_data import profile_columns

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[block_on("first_name")],
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

df = splink_datasets.fake_1000
df2 = splink_datasets.fake_1000.copy()
con = duckdb.connect()

con.sql("CREATE TABLE my_table AS SELECT * FROM df")


db_api = DuckDBAPI(connection=con)
db_api.register_multiple_tables(["my_table", df2])
con.execute("SELECT name FROM sqlite_master WHERE type='table'").df()
# db_api.debug_mode = True
# logging.basicConfig(
#     format="%(message)s",
# )
# splink_logger = logging.getLogger("splink")
# logging.getLogger("splink").setLevel(1)

profile_columns(
    ["my_table", df2], column_expressions=["first_name", "surname"], db_api=db_api
)


con.execute("SELECT name FROM sqlite_master WHERE type='table'").df()

…mns_cleanup

RobinL · 2024-03-14T09:55:30Z

splink/duckdb/database_api.py

-        self._execute_sql_against_backend(
-            f"CREATE TABLE {table_name} AS SELECT * FROM input"
-        )
+        self._con.register(table_name, input)


This change has two effects:

It reduces memory usage, see script below

It means the table doesn't need to be cleaned up. The create table version creates a materialised table in the database. But since this is not a SplinkDataframe and is not registered in the intermediate table cache, it isn't cleaned up by delete_tables_created_by_splink_from_db() . That method iterates through SplinkDataFrames that Splink knows about and if 'created_by_splink' is set to True, allows them to be deleted.

Illustration of how register doesn't use memory but create table does

import os import duckdb import pandas as pd import psutil def get_size(): process = psutil.Process(os.getpid()) bytes = process.memory_info().rss # in bytes factor = 1024 for unit in ["", "K", "M", "G", "T", "P"]: if bytes < factor: return f"{bytes:.2f}{unit}B" bytes /= factor print("Initial memory usage:", get_size()) df = pd.read_parquet( "/Users/robinlinacre/Documents/data_linking/ohid_docker/synthetic_1m.parquet" ) print("Memory usage after loading DataFrame:", get_size()) con = duckdb.connect() con.register("registered_table", df) print("Memory usage after registering DataFrame as a table in DuckDB:", get_size()) con.sql("CREATE TABLE my_table AS SELECT * FROM registered_table") print("Memory usage after creating a new table from the registered table:", get_size())

Initial memory usage: 119.45MB Memory usage after loading DataFrame: 865.04MB Memory usage after registering DataFrame as a table in DuckDB: 867.31MB Memory usage after creating a new table from the registered table: 1000.46MB

In terms of what this does, see here

DuckDB also supports “registering” a DataFrame or Arrow object as a virtual table, comparable to a SQL VIEW. This is useful when querying a DataFrame/Arrow object that is stored in another way (as a class variable, or a value in a dictionary). Below is a Pandas example

Yeah that makes sense. I originally switched to using tables when I implemented the DatabaseAPI as at that time it was difficult getting things working with some things being tables and some being views (see note under 'DuckDB registration'), and that was easiest for time being, but happy to go for the lighter approach if that works okay

ah looks like there is still some sort of clash with views/tables. but yeah if that can be resolved i'm all in favour of the reduced-memory option

RobinL · 2024-03-14T09:57:35Z

splink/duckdb/database_api.py

@@ -61,11 +61,7 @@ def _table_registration(self, input, table_name) -> None:
        elif isinstance(input, list):
            input = pd.DataFrame.from_records(input)

-        # Registration errors will automatically


Do you remember the details of this comment - is it intended to distinguish between the previous behaviour of register, or just a general comment to point out we don't need any error checking logic.

I think you also get error checking when using register:

con = duckdb.connect() con.register("registered_table", [{'a': 1}])

> InvalidInputException: Invalid Input Error: Python Object list not suitable to be registered as a view

I think this is just a general comment - it is around in Splink 3 already, and seems to originate (with a slight expansion) from quite a while back

RobinL · 2024-03-14T10:51:03Z

splink/database_api.py

@@ -339,3 +339,8 @@ def remove_splinkdataframe_from_cache(self, splink_dataframe: SplinkDataFrame):

        for k in keys_to_delete:
            del self._intermediate_table_cache[k]
+
+    def delete_tables_created_by_splink_from_db(self):


Moved linker.py to the database_api because otherwise it isn't accessible to methods like profile_columns which are using the dbapi but not the linker

RobinL · 2024-03-14T15:12:00Z

@ADBond apologies! Tests passing now. Feel free to merge the mypy one first, and i can update this in case it introduces a mypy failure!

ADBond

Great - all looks good to me 👍
Have merged mypy branch, so happy for you to merge this if there's no clash there

RobinL added 3 commits March 14, 2024 09:37

cleanup tables created by profile columns

babbc27

Merge branch 'register_multiple_tables_convenience' into profile_colu…

ad7773d

…mns_cleanup

use register for lower memory usage

9380361

RobinL changed the base branch from master to splink4_dev March 14, 2024 09:52

lint with black

dfa6570

RobinL commented Mar 14, 2024

View reviewed changes

RobinL changed the title ~~[WIP] Profile columns cleanup~~ [WIP] Clean up materialised tables that Splink creates when we're not using linker.py (profiling) Mar 14, 2024

RobinL requested a review from ADBond March 14, 2024 11:02

deal with views

12884a1

RobinL changed the title ~~[WIP] Clean up materialised tables that Splink creates when we're not using linker.py (profiling)~~ Clean up materialised tables that Splink creates when we're not using linker.py (profiling) Mar 14, 2024

ADBond approved these changes Mar 14, 2024

View reviewed changes

Merge branch 'splink4_dev' into profile_columns_cleanup

ace09c4

RobinL merged commit a6ed0bd into splink4_dev Mar 14, 2024
11 checks passed

RobinL deleted the profile_columns_cleanup branch March 14, 2024 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058

Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058

RobinL commented Mar 14, 2024 •

edited

Loading

RobinL Mar 14, 2024 •

edited

Loading

RobinL Mar 14, 2024

ADBond Mar 14, 2024

ADBond Mar 14, 2024

RobinL Mar 14, 2024 •

edited

Loading

ADBond Mar 14, 2024

RobinL Mar 14, 2024 •

edited

Loading

RobinL commented Mar 14, 2024 •

edited

Loading

ADBond left a comment

Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058

Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058

Conversation

RobinL commented Mar 14, 2024 • edited Loading

RobinL Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

RobinL Mar 14, 2024

Choose a reason for hiding this comment

ADBond Mar 14, 2024

Choose a reason for hiding this comment

ADBond Mar 14, 2024

Choose a reason for hiding this comment

RobinL Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

ADBond Mar 14, 2024

Choose a reason for hiding this comment

RobinL Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

RobinL commented Mar 14, 2024 • edited Loading

ADBond left a comment

Choose a reason for hiding this comment

RobinL commented Mar 14, 2024 •

edited

Loading

RobinL Mar 14, 2024 •

edited

Loading

RobinL Mar 14, 2024 •

edited

Loading

RobinL Mar 14, 2024 •

edited

Loading

RobinL commented Mar 14, 2024 •

edited

Loading