Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor blocking to not need linker #2180

Merged
merged 59 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
6c0c1b7
refactor blocking to not need linker
RobinL May 14, 2024
c6fdb0a
refactor estimate u to use new block_using_rules_sqls
RobinL May 14, 2024
a38033c
refactor estimate u to use new block_using_rules_sqls
RobinL May 14, 2024
7eebdcc
test fixes
RobinL May 14, 2024
9220f3c
fix rr tests
RobinL May 14, 2024
196b1fc
fix
RobinL May 14, 2024
3460266
deal with case of no matches
RobinL May 14, 2024
fe93bd4
fix determinisic link test
RobinL May 14, 2024
e8e9d2b
fix unlinkables
RobinL May 14, 2024
36e5eb7
fix tests of efficient join types
RobinL May 14, 2024
309e02d
unlinkables
RobinL May 14, 2024
9d485e0
find matches to new records
RobinL May 14, 2024
a553b5c
m training
RobinL May 15, 2024
4a45484
fix test new db api test
RobinL May 15, 2024
3b0056f
duckdb tests pass
RobinL May 15, 2024
7ddf4e2
start to fix analyse blocking tests
RobinL May 15, 2024
5a35dc3
br tests
RobinL May 15, 2024
701390a
fix test total comparison count
RobinL May 15, 2024
c6ae3ce
more fixes to test analyse blocking
RobinL May 15, 2024
55a1db4
test edge cases
RobinL May 15, 2024
71c9fee
final fixes to test analyse blocking
RobinL May 15, 2024
6b78f9b
fix full example postgres
RobinL May 15, 2024
4468b9d
fix postgres test
RobinL May 15, 2024
004ed96
fix case of no matches returned
RobinL May 15, 2024
2343c18
fix autofixable
RobinL May 15, 2024
6318189
formatting
RobinL May 15, 2024
505ad99
fix mypy errors
RobinL May 15, 2024
96af276
more mypy
RobinL May 15, 2024
84c19cb
fix array exlode test
RobinL May 15, 2024
7b28cb0
fix link type options
RobinL May 15, 2024
8c8589f
fix more mypy errors
RobinL May 15, 2024
9d2f82b
fix more mypy stuff
RobinL May 15, 2024
59bdc38
rename arg
RobinL May 15, 2024
bacf875
fix tests
RobinL May 15, 2024
16c8c2c
fix compare two records
RobinL May 15, 2024
a952943
fix linker mypy
RobinL May 15, 2024
1ec0695
all mypy except auto blocking
RobinL May 15, 2024
efa6435
final mypy errors
RobinL May 15, 2024
8a1ff88
rename module to blocking analysis
RobinL May 15, 2024
2a896e6
move files
RobinL May 15, 2024
afe1d9c
aliases for public api
RobinL May 15, 2024
b288849
update blocking notebook
RobinL May 15, 2024
5eb1f5d
tests pass again
RobinL May 15, 2024
75bbe66
deterministic dedupe example
RobinL May 15, 2024
ea1a264
convert more notebooks
RobinL May 15, 2024
f393bd5
fix more notebooks
RobinL May 15, 2024
4b1d009
check source dataset works as intended
RobinL May 15, 2024
1899563
deal with different intput types, including single table with source …
RobinL May 16, 2024
95a92ff
fix bugs introduced by none inputcolumn
RobinL May 16, 2024
01b8d6e
add further tests
RobinL May 16, 2024
ef7fdd8
fix tests
RobinL May 16, 2024
acd5074
return type is typle
RobinL May 16, 2024
062b0f0
mypy passes
RobinL May 16, 2024
4861c7e
update notebooks
RobinL May 16, 2024
4910795
ensure is iterable
RobinL May 16, 2024
4afc9e2
add where filter condition to output
RobinL May 16, 2024
b362b76
fix tests
RobinL May 16, 2024
8290dfb
rename api
RobinL May 16, 2024
a6b3176
rename api
RobinL May 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
516 changes: 232 additions & 284 deletions docs/demos/examples/duckdb/deduplicate_50k_synthetic.ipynb

Large diffs are not rendered by default.

382 changes: 196 additions & 186 deletions docs/demos/examples/duckdb/deterministic_dedupe.ipynb

Large diffs are not rendered by default.

351 changes: 166 additions & 185 deletions docs/demos/examples/duckdb/febrl3.ipynb

Large diffs are not rendered by default.

8,038 changes: 4,006 additions & 4,032 deletions docs/demos/examples/duckdb/febrl4.ipynb

Large diffs are not rendered by default.

378 changes: 149 additions & 229 deletions docs/demos/examples/duckdb/transactions.ipynb

Large diffs are not rendered by default.

595 changes: 285 additions & 310 deletions docs/demos/examples/sqlite/deduplicate_50k_synthetic.ipynb

Large diffs are not rendered by default.

982 changes: 501 additions & 481 deletions docs/demos/tutorials/03_Blocking.ipynb

Large diffs are not rendered by default.

15 changes: 13 additions & 2 deletions scripts/reduce_notebook_runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,20 +19,31 @@ def modify_notebook(file_path):
data["cells"] = data["cells"][:19]
changed = True

if "sqlite" in file_path:
max_pairs = 3e5
head_num = 800
else:
max_pairs = 1e5
head_num = 400

for cell in data["cells"]:
if cell["cell_type"] == "code":
source = cell["source"]
new_source = []
for line in source:
if "splink_datasets" in line and "=" in line:
parts = line.split("=")
parts[1] = parts[1].strip() + ".head(400)"
parts[1] = parts[1].strip() + f".head({head_num})"
new_line = " = ".join(parts) + "\n"
new_source.append(new_line)
changed = True
elif "estimate_u_using_random_sampling(" in line:
new_line = (
re.sub(r"max_pairs=\d+(\.\d+)?[eE]\d+", "max_pairs=1e5", line)
re.sub(
r"max_pairs=\d+(\.\d+)?[eE]\d+",
f"max_pairs={max_pairs}",
line,
)
+ "\n"
)
new_source.append(new_line)
Expand Down
294 changes: 0 additions & 294 deletions splink/analyse_blocking.py

This file was deleted.

Loading
Loading