Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with _source_dataset_col and _source_dataset_input_column #1731

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Nov 13, 2023

Closes #1711, in which there are 'lab notes' of different things I tried to work out htep roblem

Summary

  • Ensure a source_dataset column is only created if it doesn't already exist
  • Fix the two_dataset_link_only case so that the __splink__df_concat_with_tf is correctly split into a right and left table

Details

I went back to try and identify the root cause of the problem by undoing #1193 and #1204 and trying a different approach

As a reminder:

  • Where multiple input datasets are provided, the default behaviour is for Splink to create a source_dataset column and populate it with a name, one for each input table
  • If this column already exists, the user has the option to set a source_dataset_column_name key in the settings object, to tell Splink not to create this new column
  • it's also possible for the user to provide a single input table with a source dataset column (set by source_dataset_column_name), for a multi-link job with a single input table

There were several issues:

  • Splink created a new source_dataset column in all cases, irrespective of whether source_dataset_column_name was set. This was particularly problematic when source_dataset_column_name = "source_dataset" because the value in the original data got overwritten

  • There's a special case of two_dataset_link_only. The standard approach to blocking is to create a single concatenated table and join it to itself. This results in within-dataset comparisons being created and rejected in the link only case. For the two dataset case, there's a substantial efficiency gain by splitting the concatenated table into a left and right table, and doing a left join, because this prevents any within-dataset comparisons needing to be evaluate.

    • Splitting the table into two is difficult source_dataset_column_name is provided because the concatenated table needs to be filtered on the values in source_dataset_column_name, but to know the values you have to look in the data. I've solved this by the sub-select queries select min(source_dataset_column_name) and select max(source_dataset_column_name)

Testing scripts

https://gist.github.com/RobinL/8aadeb37c84849276b88269e14bc9cbe

@RobinL RobinL changed the title (WIP) Experiment with sorting out _source_dataset_col and _source_dataset_input_column Fix issue with _source_dataset_col and _source_dataset_input_column Nov 15, 2023
@RobinL
Copy link
Member Author

RobinL commented Nov 15, 2023

@ThomasHepworth I requested a review when the tests were passing and now they're not, so hold off until I fix!!

Comment on lines -417 to -420
@property
def _verify_link_only_job(self):
cache = self._intermediate_table_cache
if "__splink__df_concat_with_tf" not in cache:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a simplified version of this in either the linker or the validation.

Attempting a link only job with a single input df now results in a random error that isn't immediately obvious.

Example Code
from splink.duckdb.duckdb_linker import DuckDBLinker
import pandas as pd

from tests.basic_settings import get_settings_dict

df = pd.read_csv("./tests/datasets/fake_1000_from_splink_demos.csv")

settings = get_settings_dict()
settings["link_type"] = "link_only"

linker = DuckDBLinker(
    df,
    settings
)

linker.estimate_u_using_random_sampling(target_rows=1e6)

Perhaps we should raise an issue and I'll add it to the easy fixes list?

Copy link
Contributor

@ThomasHepworth ThomasHepworth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@RobinL RobinL merged commit 94779ba into master Nov 15, 2023
8 checks passed
@RobinL RobinL deleted the 1711-bug-settings_obj_source_dataset_col-and-settings_obj_source_dataset_input_column branch November 15, 2023 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] settings_obj._source_dataset_col and settings_obj._source_dataset_input_column
2 participants