-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058
Conversation
self._execute_sql_against_backend( | ||
f"CREATE TABLE {table_name} AS SELECT * FROM input" | ||
) | ||
self._con.register(table_name, input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change has two effects:
- It reduces memory usage, see script below
- It means the table doesn't need to be cleaned up. The
create table
version creates a materialised table in the database. But since this is not a SplinkDataframe and is not registered in the intermediate table cache, it isn't cleaned up bydelete_tables_created_by_splink_from_db()
. That method iterates through SplinkDataFrames that Splink knows about and if 'created_by_splink' is set to True, allows them to be deleted.
Illustration of how register doesn't use memory but create table does
import os
import duckdb
import pandas as pd
import psutil
def get_size():
process = psutil.Process(os.getpid())
bytes = process.memory_info().rss # in bytes
factor = 1024
for unit in ["", "K", "M", "G", "T", "P"]:
if bytes < factor:
return f"{bytes:.2f}{unit}B"
bytes /= factor
print("Initial memory usage:", get_size())
df = pd.read_parquet(
"/Users/robinlinacre/Documents/data_linking/ohid_docker/synthetic_1m.parquet"
)
print("Memory usage after loading DataFrame:", get_size())
con = duckdb.connect()
con.register("registered_table", df)
print("Memory usage after registering DataFrame as a table in DuckDB:", get_size())
con.sql("CREATE TABLE my_table AS SELECT * FROM registered_table")
print("Memory usage after creating a new table from the registered table:", get_size())
Initial memory usage: 119.45MB
Memory usage after loading DataFrame: 865.04MB
Memory usage after registering DataFrame as a table in DuckDB: 867.31MB
Memory usage after creating a new table from the registered table: 1000.46MB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of what this does, see here
DuckDB also supports “registering” a DataFrame or Arrow object as a virtual table, comparable to a SQL VIEW. This is useful when querying a DataFrame/Arrow object that is stored in another way (as a class variable, or a value in a dictionary). Below is a Pandas example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense. I originally switched to using tables when I implemented the DatabaseAPI
as at that time it was difficult getting things working with some things being tables and some being views (see note under 'DuckDB registration'), and that was easiest for time being, but happy to go for the lighter approach if that works okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah looks like there is still some sort of clash with views/tables. but yeah if that can be resolved i'm all in favour of the reduced-memory option
@@ -61,11 +61,7 @@ def _table_registration(self, input, table_name) -> None: | |||
elif isinstance(input, list): | |||
input = pd.DataFrame.from_records(input) | |||
|
|||
# Registration errors will automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember the details of this comment - is it intended to distinguish between the previous behaviour of register
, or just a general comment to point out we don't need any error checking logic.
I think you also get error checking when using register:
con = duckdb.connect()
con.register("registered_table", [{'a': 1}])
> InvalidInputException: Invalid Input Error: Python Object list not suitable to be registered as a view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a general comment - it is around in Splink 3 already, and seems to originate (with a slight expansion) from quite a while back
@@ -339,3 +339,8 @@ def remove_splinkdataframe_from_cache(self, splink_dataframe: SplinkDataFrame): | |||
|
|||
for k in keys_to_delete: | |||
del self._intermediate_table_cache[k] | |||
|
|||
def delete_tables_created_by_splink_from_db(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved linker.py to the database_api because otherwise it isn't accessible to methods like profile_columns
which are using the dbapi but not the linker
@ADBond apologies! Tests passing now. Feel free to merge the mypy one first, and i can update this in case it introduces a mypy failure! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - all looks good to me 👍
Have merged mypy branch, so happy for you to merge this if there's no clash there
This PR ensures that we delete tables created by the profiling code so the database doesn't get littered with lots of temp tables.
This does mean they won't exist in the cache, meaning if for some reason want to call
profile_columns
twice, the second one call won't be able to use the cache. But I can't see when that'd matter.example