Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058

Merged
merged 6 commits into from
Mar 14, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Mar 14, 2024

This PR ensures that we delete tables created by the profiling code so the database doesn't get littered with lots of temp tables.

This does mean they won't exist in the cache, meaning if for some reason want to call profile_columns twice, the second one call won't be able to use the cache. But I can't see when that'd matter.

example
import logging

import duckdb

import splink.comparison_library as cl
from splink import (
    DuckDBAPI,
    SettingsCreator,
    block_on,
    splink_datasets,
)
from splink.blocking_rule_library import block_on
from splink.datasets import splink_datasets
from splink.profile_data import profile_columns

settings = SettingsCreator(
    link_type="dedupe_only",
    comparisons=[
        cl.ExactMatch("first_name"),
        cl.ExactMatch("surname"),
        cl.ExactMatch("dob"),
        cl.ExactMatch("city"),
        cl.ExactMatch("email"),
    ],
    blocking_rules_to_generate_predictions=[block_on("first_name")],
    retain_matching_columns=True,
    retain_intermediate_calculation_columns=True,
)

df = splink_datasets.fake_1000
df2 = splink_datasets.fake_1000.copy()
con = duckdb.connect()

con.sql("CREATE TABLE my_table AS SELECT * FROM df")


db_api = DuckDBAPI(connection=con)
db_api.register_multiple_tables(["my_table", df2])
con.execute("SELECT name FROM sqlite_master WHERE type='table'").df()
# db_api.debug_mode = True
# logging.basicConfig(
#     format="%(message)s",
# )
# splink_logger = logging.getLogger("splink")
# logging.getLogger("splink").setLevel(1)

profile_columns(
    ["my_table", df2], column_expressions=["first_name", "surname"], db_api=db_api
)


con.execute("SELECT name FROM sqlite_master WHERE type='table'").df()

@RobinL RobinL changed the base branch from master to splink4_dev March 14, 2024 09:52
self._execute_sql_against_backend(
f"CREATE TABLE {table_name} AS SELECT * FROM input"
)
self._con.register(table_name, input)
Copy link
Member Author

@RobinL RobinL Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has two effects:

  1. It reduces memory usage, see script below
  2. It means the table doesn't need to be cleaned up. The create table version creates a materialised table in the database. But since this is not a SplinkDataframe and is not registered in the intermediate table cache, it isn't cleaned up by delete_tables_created_by_splink_from_db() . That method iterates through SplinkDataFrames that Splink knows about and if 'created_by_splink' is set to True, allows them to be deleted.
Illustration of how register doesn't use memory but create table does
import os

import duckdb
import pandas as pd
import psutil


def get_size():
    process = psutil.Process(os.getpid())
    bytes = process.memory_info().rss  # in bytes
    factor = 1024
    for unit in ["", "K", "M", "G", "T", "P"]:
        if bytes < factor:
            return f"{bytes:.2f}{unit}B"
        bytes /= factor


print("Initial memory usage:", get_size())

df = pd.read_parquet(
    "/Users/robinlinacre/Documents/data_linking/ohid_docker/synthetic_1m.parquet"
)

print("Memory usage after loading DataFrame:", get_size())

con  = duckdb.connect()

con.register("registered_table", df)
print("Memory usage after registering DataFrame as a table in DuckDB:", get_size())

con.sql("CREATE TABLE my_table AS SELECT * FROM registered_table")

print("Memory usage after creating a new table from the registered table:", get_size())
Initial memory usage: 119.45MB
Memory usage after loading DataFrame: 865.04MB
Memory usage after registering DataFrame as a table in DuckDB: 867.31MB
Memory usage after creating a new table from the registered table: 1000.46MB

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of what this does, see here

DuckDB also supports “registering” a DataFrame or Arrow object as a virtual table, comparable to a SQL VIEW. This is useful when querying a DataFrame/Arrow object that is stored in another way (as a class variable, or a value in a dictionary). Below is a Pandas example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense. I originally switched to using tables when I implemented the DatabaseAPI as at that time it was difficult getting things working with some things being tables and some being views (see note under 'DuckDB registration'), and that was easiest for time being, but happy to go for the lighter approach if that works okay

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah looks like there is still some sort of clash with views/tables. but yeah if that can be resolved i'm all in favour of the reduced-memory option

@@ -61,11 +61,7 @@ def _table_registration(self, input, table_name) -> None:
elif isinstance(input, list):
input = pd.DataFrame.from_records(input)

# Registration errors will automatically
Copy link
Member Author

@RobinL RobinL Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember the details of this comment - is it intended to distinguish between the previous behaviour of register, or just a general comment to point out we don't need any error checking logic.

I think you also get error checking when using register:

con = duckdb.connect()
con.register("registered_table", [{'a': 1}])
> InvalidInputException: Invalid Input Error: Python Object list not suitable to be registered as a view

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just a general comment - it is around in Splink 3 already, and seems to originate (with a slight expansion) from quite a while back

@@ -339,3 +339,8 @@ def remove_splinkdataframe_from_cache(self, splink_dataframe: SplinkDataFrame):

for k in keys_to_delete:
del self._intermediate_table_cache[k]

def delete_tables_created_by_splink_from_db(self):
Copy link
Member Author

@RobinL RobinL Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved linker.py to the database_api because otherwise it isn't accessible to methods like profile_columns which are using the dbapi but not the linker

@RobinL RobinL changed the title [WIP] Profile columns cleanup [WIP] Clean up materialised tables that Splink creates when we're not using linker.py (profiling) Mar 14, 2024
@RobinL RobinL requested a review from ADBond March 14, 2024 11:02
@RobinL RobinL changed the title [WIP] Clean up materialised tables that Splink creates when we're not using linker.py (profiling) Clean up materialised tables that Splink creates when we're not using linker.py (profiling) Mar 14, 2024
@RobinL
Copy link
Member Author

RobinL commented Mar 14, 2024

@ADBond apologies! Tests passing now. Feel free to merge the mypy one first, and i can update this in case it introduces a mypy failure!

Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - all looks good to me 👍
Have merged mypy branch, so happy for you to merge this if there's no clash there

@RobinL RobinL merged commit a6ed0bd into splink4_dev Mar 14, 2024
11 checks passed
@RobinL RobinL deleted the profile_columns_cleanup branch March 14, 2024 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants