-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up materialised tables that Splink creates when we're not using linker.py (profiling) #2058
Changes from all commits
babbc27
ad7773d
9380361
dfa6570
12884a1
ace09c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -55,17 +55,24 @@ def __init__( | |
""" | ||
) | ||
|
||
def delete_table_from_database(self, name: str): | ||
# If the table is in fact a pandas dataframe that's been registered using | ||
# duckdb con.register() then DROP TABLE will fail with | ||
# Catalog Error: x is of type View | ||
try: | ||
drop_sql = f"DROP TABLE IF EXISTS {name}" | ||
self._execute_sql_against_backend(drop_sql) | ||
except duckdb.CatalogException: | ||
drop_sql = f"DROP VIEW IF EXISTS {name}" | ||
self._execute_sql_against_backend(drop_sql) | ||
|
||
def _table_registration(self, input, table_name) -> None: | ||
if isinstance(input, dict): | ||
input = pd.DataFrame(input) | ||
elif isinstance(input, list): | ||
input = pd.DataFrame.from_records(input) | ||
|
||
# Registration errors will automatically | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you remember the details of this comment - is it intended to distinguish between the previous behaviour of I think you also get error checking when using register:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is just a general comment - it is around in Splink 3 already, and seems to originate (with a slight expansion) from quite a while back |
||
# occur if an invalid data type is passed as an argument | ||
self._execute_sql_against_backend( | ||
f"CREATE TABLE {table_name} AS SELECT * FROM input" | ||
) | ||
self._con.register(table_name, input) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change has two effects:
Illustration of how register doesn't use memory but create table doesimport os
import duckdb
import pandas as pd
import psutil
def get_size():
process = psutil.Process(os.getpid())
bytes = process.memory_info().rss # in bytes
factor = 1024
for unit in ["", "K", "M", "G", "T", "P"]:
if bytes < factor:
return f"{bytes:.2f}{unit}B"
bytes /= factor
print("Initial memory usage:", get_size())
df = pd.read_parquet(
"/Users/robinlinacre/Documents/data_linking/ohid_docker/synthetic_1m.parquet"
)
print("Memory usage after loading DataFrame:", get_size())
con = duckdb.connect()
con.register("registered_table", df)
print("Memory usage after registering DataFrame as a table in DuckDB:", get_size())
con.sql("CREATE TABLE my_table AS SELECT * FROM registered_table")
print("Memory usage after creating a new table from the registered table:", get_size())
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In terms of what this does, see here
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah that makes sense. I originally switched to using tables when I implemented the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah looks like there is still some sort of clash with views/tables. but yeah if that can be resolved i'm all in favour of the reduced-memory option |
||
|
||
def table_to_splink_dataframe( | ||
self, templated_name, physical_name | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved linker.py to the database_api because otherwise it isn't accessible to methods like
profile_columns
which are using the dbapi but not the linker