Fix issue with `_source_dataset_col` and `_source_dataset_input_column` #1731

RobinL · 2023-11-13T19:31:13Z

Closes #1711, in which there are 'lab notes' of different things I tried to work out htep roblem

Summary

Ensure a source_dataset column is only created if it doesn't already exist
Fix the two_dataset_link_only case so that the __splink__df_concat_with_tf is correctly split into a right and left table

Details

I went back to try and identify the root cause of the problem by undoing #1193 and #1204 and trying a different approach

As a reminder:

Where multiple input datasets are provided, the default behaviour is for Splink to create a source_dataset column and populate it with a name, one for each input table
If this column already exists, the user has the option to set a source_dataset_column_name key in the settings object, to tell Splink not to create this new column
it's also possible for the user to provide a single input table with a source dataset column (set by source_dataset_column_name), for a multi-link job with a single input table

There were several issues:

Splink created a new source_dataset column in all cases, irrespective of whether source_dataset_column_name was set. This was particularly problematic when source_dataset_column_name = "source_dataset" because the value in the original data got overwritten
There's a special case of two_dataset_link_only. The standard approach to blocking is to create a single concatenated table and join it to itself. This results in within-dataset comparisons being created and rejected in the link only case. For the two dataset case, there's a substantial efficiency gain by splitting the concatenated table into a left and right table, and doing a left join, because this prevents any within-dataset comparisons needing to be evaluate.
- Splitting the table into two is difficult source_dataset_column_name is provided because the concatenated table needs to be filtered on the values in source_dataset_column_name, but to know the values you have to look in the data. I've solved this by the sub-select queries select min(source_dataset_column_name) and select max(source_dataset_column_name)

Testing scripts

https://gist.github.com/RobinL/8aadeb37c84849276b88269e14bc9cbe

…nd-settings_obj_source_dataset_input_column

RobinL · 2023-11-15T12:25:53Z

@ThomasHepworth I requested a review when the tests were passing and now they're not, so hold off until I fix!!

…nd-settings_obj_source_dataset_input_column

ThomasHepworth · 2023-11-15T16:22:11Z

splink/linker.py

-    @property
-    def _verify_link_only_job(self):
-        cache = self._intermediate_table_cache
-        if "__splink__df_concat_with_tf" not in cache:


I think we should have a simplified version of this in either the linker or the validation.

Attempting a link only job with a single input df now results in a random error that isn't immediately obvious.

Example Code

from splink.duckdb.duckdb_linker import DuckDBLinker import pandas as pd from tests.basic_settings import get_settings_dict df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv") settings = get_settings_dict() settings["link_type"] = "link_only" linker = DuckDBLinker( df, settings ) linker.estimate_u_using_random_sampling(target_rows=1e6)

Perhaps we should raise an issue and I'll add it to the easy fixes list?

ThomasHepworth

Looks great, thanks!

RobinL added 3 commits November 13, 2023 19:16

experiment with a fix

8921fc1

complete removals

8a37328

remove tests

c354efd

RobinL linked an issue Nov 13, 2023 that may be closed by this pull request

[BUG] settings_obj._source_dataset_col and settings_obj._source_dataset_input_column #1711

Closed

RobinL added 6 commits November 13, 2023 20:47

fix additional references to source_dataset_input_column

c07138a

remove tests

222b41c

use correct physical name

ed0d13b

tests pass locally

64f67c9

protect against no settings obj provided

9bf59f3

Merge branch 'master' into 1711-bug-settings_obj_source_dataset_col-a…

b4f9ec6

…nd-settings_obj_source_dataset_input_column

RobinL requested a review from ThomasHepworth November 14, 2023 08:42

RobinL changed the title ~~(WIP) Experiment with sorting out _source_dataset_col and _source_dataset_input_column~~ Fix issue with _source_dataset_col and _source_dataset_input_column Nov 15, 2023

RobinL added 2 commits November 15, 2023 06:30

Merge branch 'master' into 1711-bug-settings_obj_source_dataset_col-a…

155e983

…nd-settings_obj_source_dataset_input_column

Merge branch 'master' into 1711-bug-settings_obj_source_dataset_col-a…

2b6ce7b

…nd-settings_obj_source_dataset_input_column

RobinL added 3 commits November 15, 2023 14:04

Merge branch 'master' into 1711-bug-settings_obj_source_dataset_col-a…

f7fb217

…nd-settings_obj_source_dataset_input_column

input column now properties not methods

4067cdc

Merge branch 'master' into 1711-bug-settings_obj_source_dataset_col-a…

e4cac3f

…nd-settings_obj_source_dataset_input_column

ThomasHepworth reviewed Nov 15, 2023

View reviewed changes

ThomasHepworth approved these changes Nov 15, 2023

View reviewed changes

RobinL merged commit 94779ba into master Nov 15, 2023
8 checks passed

RobinL deleted the 1711-bug-settings_obj_source_dataset_col-and-settings_obj_source_dataset_input_column branch November 15, 2023 16:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with `_source_dataset_col` and `_source_dataset_input_column` #1731

Fix issue with `_source_dataset_col` and `_source_dataset_input_column` #1731

RobinL commented Nov 13, 2023 •

edited

Loading

RobinL commented Nov 15, 2023

ThomasHepworth Nov 15, 2023

ThomasHepworth left a comment

Fix issue with _source_dataset_col and _source_dataset_input_column #1731

Fix issue with _source_dataset_col and _source_dataset_input_column #1731

Conversation

RobinL commented Nov 13, 2023 • edited Loading

Summary

Details

Testing scripts

RobinL commented Nov 15, 2023

ThomasHepworth Nov 15, 2023

Choose a reason for hiding this comment

ThomasHepworth left a comment

Choose a reason for hiding this comment

Fix issue with `_source_dataset_col` and `_source_dataset_input_column` #1731

Fix issue with `_source_dataset_col` and `_source_dataset_input_column` #1731

RobinL commented Nov 13, 2023 •

edited

Loading