-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with _source_dataset_col
and _source_dataset_input_column
#1731
Fix issue with _source_dataset_col
and _source_dataset_input_column
#1731
Conversation
…nd-settings_obj_source_dataset_input_column
_source_dataset_col
and _source_dataset_input_column
_source_dataset_col
and _source_dataset_input_column
…nd-settings_obj_source_dataset_input_column
…nd-settings_obj_source_dataset_input_column
@ThomasHepworth I requested a review when the tests were passing and now they're not, so hold off until I fix!! |
…nd-settings_obj_source_dataset_input_column
…nd-settings_obj_source_dataset_input_column
@property | ||
def _verify_link_only_job(self): | ||
cache = self._intermediate_table_cache | ||
if "__splink__df_concat_with_tf" not in cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a simplified version of this in either the linker or the validation.
Attempting a link only job with a single input df now results in a random error that isn't immediately obvious.
Example Code
from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd
from tests.basic_settings import get_settings_dict
df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")
settings = get_settings_dict()
settings["link_type"] = "link_only"
linker = DuckDBLinker(
df,
settings
)
linker.estimate_u_using_random_sampling(target_rows=1e6)
Perhaps we should raise an issue and I'll add it to the easy fixes list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
Closes #1711, in which there are 'lab notes' of different things I tried to work out htep roblem
Summary
source_dataset
column is only created if it doesn't already existtwo_dataset_link_only
case so that the__splink__df_concat_with_tf
is correctly split into a right and left tableDetails
I went back to try and identify the root cause of the problem by undoing #1193 and #1204 and trying a different approach
As a reminder:
source_dataset
column and populate it with a name, one for each input tablesource_dataset_column_name
key in the settings object, to tell Splink not to create this new columnsource_dataset_column_name
), for a multi-link job with a single input tableThere were several issues:
Splink created a new
source_dataset
column in all cases, irrespective of whethersource_dataset_column_name
was set. This was particularly problematic whensource_dataset_column_name = "source_dataset"
because the value in the original data got overwrittenThere's a special case of
two_dataset_link_only
. The standard approach to blocking is to create a single concatenated table and join it to itself. This results in within-dataset comparisons being created and rejected in the link only case. For the two dataset case, there's a substantial efficiency gain by splitting the concatenated table into aleft
andright
table, and doing a left join, because this prevents any within-dataset comparisons needing to be evaluate.source_dataset_column_name
is provided because the concatenated table needs to be filtered on the values insource_dataset_column_name
, but to know the values you have to look in the data. I've solved this by the sub-select queriesselect min(source_dataset_column_name)
andselect max(source_dataset_column_name)
Testing scripts
https://gist.github.com/RobinL/8aadeb37c84849276b88269e14bc9cbe