feat: Adds MERGE INTO transform #109

jsibbison-square · 2024-06-27T07:04:06Z

This commit adds the ability to convert snowflakes MERGE INTO functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed independently.

This commit only adds the transform, there is more refactoring required in fakes.py in order to handle a transform that transforms a single expression into multiple expressions so it cannot be used yet. I have a change baking for adding it into fakes but wanted to separate PRs.

Let me know if you need more tests or anything as its quite a complicated feature.

This commit adds the ability to convert snowflakes [MERGE INTO](https://docs.snowflake.com/en/sql-reference/sql/merge) functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed indepedently. This commit only adds the transform, there is more refactoring required in fakes.py in order to handle a transform that transforms a single expression into multiple expressions.

tekumara · 2024-06-29T13:11:54Z

Thanks for starting this! It's a tricky feature.

I hope you don't mind but I've pushed 40ada94 as a test case to cover. It comes from the Snowflake MERGE docs. It provides good coverage with a mix of delete, update and insert branches with some conditionals:

MERGE INTO t1 USING t2 ON t1.t1Key = t2.t2Key
    WHEN MATCHED AND t2.marked = 1 THEN DELETE
    WHEN MATCHED AND t2.isNewStatus = 1 THEN UPDATE SET val = t2.newVal, status = t2.newStatus
    WHEN MATCHED THEN UPDATE SET val = t2.newVal
    WHEN NOT MATCHED THEN INSERT (t1Key, val, status) VALUES (t2.t2Key, t2.newVal, t2.newStatus);

It also demonstrate that MERGE needs to:

return a result set like this
have branch conditions that are isolated and don't interact. Specifically in this example the last INSERT branch should not re-add rows removed by the DELETE branch.

One way to solve these is to calculate the merge operations up front using the original table and store then as a MERGE_OP column in a temporary table. This ensures the branch conditions work with the state of the original table, not the table after it has already been modified by a DML statement run for a previous branch.

So something like this for the test case above:

CREATE OR REPLACE TEMPORARY TABLE merge_candidates AS
SELECT
    t1.t1Key AS t1_t1Key,
    t1.val AS t1_val,
    t1.status AS t1_status,
    t2.t2Key AS t2_t2Key,
    t2.marked AS t2_marked,
    t2.isNewStatus AS t2_isNewStatus,
    t2.newVal AS t2_newVal,
    t2.newStatus AS t2_newStatus,
    CASE
        WHEN t1.t1Key = t2.t2Key AND t2.marked = 1 THEN 1           -- DELETE
        WHEN t1.t1Key = t2.t2Key and t2.isNewStatus = 1 THEN 2      -- UPDATE with conditional
        WHEN t1.t1Key = t2.t2Key THEN 3                             -- UPDATE the rest
        WHEN t1.t1Key IS NULL THEN 4                                -- INSERT remaining matches
        ELSE 0                                                      -- IGNORE
    END AS MERGE_OP
    FROM t1
FULL OUTER JOIN t2 ON t1.t1Key = t2.t2Key
WHERE MERGE_OP > 0;

This can then be used to drive the DML statements for each branch, and provide the result set, like this:

DELETE FROM t1
USING merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1key
  AND mc.merge_op = 1;

UPDATE t1
SET val = mc.t2_newVal,
    status = mc.t2_newStatus
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 2;

UPDATE t1
SET val = mc.t2_newVal
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 3;

INSERT INTO t1 (t1Key, val, status)
SELECT mc.t2_t2Key, mc.t2_newVal, mc.t2_newStatus
FROM merge_candidates AS mc
WHERE mc.merge_op = 4;

SELECT COUNT_IF(merge_op in (4)) AS "number of rows inserted", COUNT_IF(merge_op in (2,3)) AS "number of rows updated", COUNT_IF(merge_op in(1)) AS "number of rows deleted"
from merge_candidates;

Let me know what you think, and if this makes sense!

jsibbison-square · 2024-06-30T03:47:06Z

Thanks for the detailed feedback, seems merge is a lot more intricate than my usecase of upsert 😂 . I'm on holiday next week so it might be awhile before you see more progress on this. Thanks for the testcase.

* main: chore(main): release 0.9.20 (tekumara#115) fix: $$ not considered a variable fix: concurrent connection write-write conflict refactor: extract test_connect.py feat: SHOW PRIMARY KEYS for table (tekumara#114) chore(main): release 0.9.19 (tekumara#113) feat: Implements basic snowflake session variables via SET/UNSET (tekumara#111) chore(deps-dev): bump pyright from 1.1.366 to 1.1.369 (tekumara#112) chore(main): release 0.9.18 (tekumara#110)

Got basic case for merge working with a merge2 function, needs to be cleaned up, deleting original merge, fix result statement.

jsibbison-square · 2024-07-11T09:49:29Z

Marked PR as WIP while I work through getting this functionality fully running. I've got the new test case running with a second attempt but I've got lots to clean up before its ready for review.

I'll mention you when its ready for review again.

* main: fix: Only set variables for SetItem expressions (tekumara#116)

* main: feat: alter table cluster by chore: cruft update chore: bump sqlglot 25.5.1

tekumara · 2024-08-23T13:13:20Z

How did you go @jsibbison-square? Do you need any help with this?

jazzarati · 2024-08-25T09:02:02Z

How did you go @jsibbison-square? Do you need any help with this?

Hi @tekumara sorry I've run short on time at the moment. This was working as far as I was concerned but I haven't had time to refactor as its a bit messy. Happy for it to be merged as is and refactor one day or you can take over if you want the feature sooner.

* main: (38 commits) feat(server): support time & timestamp types ci(server): description - cover more types feat(server): handle snowflake ProgrammingError refactor: use rowtype for sf metadata in arrow schema feat(server): support empty result set chore(main): release 0.9.24 (tekumara#128) fix: don't require pandas at import time chore(main): release 0.9.23 (tekumara#126) feat: support conn.is_closed() refactor: extract conn, cursor, pandas_tools refactor: describe_as_rowtype uses sf_type refactor: describe_as_rowtype refactor: extract describe_as_result_metadata feat(server): support cur.description feat(server): support bool, int, float types chore(main): release 0.9.22 (tekumara#124) wip(server): add /queries/v1/abort-request fix: fetchmany supports irregular sizes fix: column types for DESCRIBE feat: describe view information_schema.* ...

* main: chore(deps): update ruff requirement from ~=0.5.1 to ~=0.6.3 (tekumara#130) chore(deps-dev): bump pyright from 1.1.374 to 1.1.378 (tekumara#133)

jsibbison-square · 2024-09-05T01:58:00Z

@tekumara I've now got a passing build, isolated merge logic into its own files and added some documentation so I can hope to recall how it works in the future. Please review 🙏

This reverts commit dfa68e1. because it doesn't work when the alias is for a select expression eg: tekumara#24 (comment)

tekumara · 2024-09-08T10:55:47Z

Thanks so much @jsibbison-square for this significant piece of work. I've pushed some refactoring to use pure functions rather than a class, as there's no permanent state to capture, and turned your comment documenting the sql output into a test.

I'm also considering whether it's worth refactoring this to have a single merge_candidates table and statement to generate the merge operations as the comment above, rather than the two tables and statement per when clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔

The main thing blocking merging this though, is it doesn't yet solve for a USING clause with a SELECT expression. See the example here. I'm not expecting you to solve this btw, I'm happy to noodle on it.

tekumara · 2024-09-15T12:18:16Z

Thanks again @jsibbison-square - I've merged this and will address the above in a follow-up PR.

jazzarati · 2024-09-15T12:35:06Z

I'm also considering whether it's worth refactoring this to have a single merge_candidates table and statement to generate the merge operations as the #109 (comment), rather than the two tables and statement per when clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔

I did start with one table but then went with two, I don't recall exactly why sadly.

Thanks for merging in its partial support form, we have a lot of usecases that are the most basic of merges so it'll be great to start using.

tekumara · 2024-09-16T10:56:08Z

Glad to hear you'll be using this soon. I've raised #136

…136) Supports MERGE with - multiple join columns, that may have the same name - when the source is a subquery To do this, I've refactored the way merge works to align with the [approach suggested here](#109 (comment)), ie: - use a single `merge_candidates` temporary table to map rows to mutation operations (delete, update, insert) - separate candidate table creation, mutation operations (delete, update, insert), and counts into separate functions, rather than co-mingling this functionality, for better readability - use join columns rather than rowids, and avoid using a transaction because duckdb doesn't support nested transactions so the current implementation will error if an existing transaction is active resolves #24

🤖 I have created a release *beep* *boop* --- ## [0.9.25](v0.9.24...v0.9.25) (2024-09-16) ### Features * Adds MERGE INTO transform ([#109](#109)) ([d5e14a7](d5e14a7)) * close duckdb connection ([223f8e2](223f8e2)) * **server:** handle snowflake ProgrammingError ([9455a43](9455a43)) * **server:** support empty result set ([b967b69](b967b69)) * **server:** support FAKESNOW_DB_PATH ([af79f77](af79f77)) * **server:** support time & timestamp types ([1606a3e](1606a3e)) * support MERGE with multiple join columns and source subqueries ([#136](#136)) ([9b5a7a0](9b5a7a0)), closes [#24](#24) ### Chores * **deps-dev:** bump pyright from 1.1.374 to 1.1.378 ([#133](#133)) ([593a420](593a420)) * **deps:** update ruff requirement from ~=0.5.1 to ~=0.6.3 ([#130](#130)) ([6b37d8b](6b37d8b)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: potatobot-prime[bot] <132267321+potatobot-prime[bot]@users.noreply.github.com>

jsibbison-square force-pushed the jsibbison-20240627-merge-into branch 3 times, most recently from b7b996b to 3d2ce14 Compare June 27, 2024 10:39

jsibbison-square force-pushed the jsibbison-20240627-merge-into branch from 3d2ce14 to c842900 Compare June 27, 2024 11:06

tekumara added 2 commits June 29, 2024 10:44

Merge branch 'main' into jsibbison-20240627-merge-into

ab86bf1

add test_merge using test case from snowflake docs

40ada94

jsibbison-square added 2 commits July 11, 2024 13:06

WIP: working on merge

2eecaea

Got basic case for merge working with a merge2 function, needs to be cleaned up, deleting original merge, fix result statement.

jsibbison-square changed the title ~~feat: Adds MERGE INTO transform~~ WIP: feat: Adds MERGE INTO transform Jul 11, 2024

jsibbison-square added 5 commits July 12, 2024 07:37

Merge branch 'main' into jsibbison-20240627-merge-into

d5d72a2

* main: fix: Only set variables for SetItem expressions (tekumara#116)

implements the return values for merge statement

a580635

Removes old merge implementation

cd38095

Merge branch 'main' into jsibbison-20240627-merge-into

0cbeb50

* main: feat: alter table cluster by chore: cruft update chore: bump sqlglot 25.5.1

Adds test for NOT MATCHED INSERT with an AND condition

4310802

tekumara force-pushed the main branch from ffb47fa to 1606a3e Compare August 26, 2024 01:20

jsibbison-square added 9 commits August 28, 2024 20:01

Ruff fixes

2ded345

ruff format

71361fd

fix typos

dd8923c

Fix pyright in cursor.py

533ca68

Extract merge transform to separate file for further refactoring

8f19190

Extract some outer expressions

d5c3e88

extract remove_table_alias helper

de6385b

Extract temp merge insertion func

d85a7b2

jsibbison-square added 8 commits September 5, 2024 09:28

Merge branch 'main' into jsibbison-20240627-merge-into

b39e2e5

* main: chore(deps): update ruff requirement from ~=0.5.1 to ~=0.6.3 (tekumara#130) chore(deps-dev): bump pyright from 1.1.374 to 1.1.378 (tekumara#133)

Extract merge tests from test_fakes

4cd831c

Add documentation to the contents of the merge transformation

71457e8

some ruff fixes

6905ac3

more ruff fixes

7cf899a

pyright fixes

ac83fed

ruff format fix

97f6e2d

remove todo

6763398

Remove debugging print

0c71782

jsibbison-square changed the title ~~WIP: feat: Adds MERGE INTO transform~~ feat: Adds MERGE INTO transform Sep 5, 2024

tekumara added 4 commits September 7, 2024 21:30

add transform test

3e3e9c3

refactor class to pure functions

63948c1

unalias

dfa68e1

Revert "unalias"

458b545

This reverts commit dfa68e1. because it doesn't work when the alias is for a select expression eg: tekumara#24 (comment)

tekumara merged commit d5e14a7 into tekumara:main Sep 15, 2024
1 check passed

potatobot-prime bot mentioned this pull request Sep 7, 2024

chore(main): release 0.9.25 #129

Merged

tekumara mentioned this pull request Sep 16, 2024

feat: support MERGE with multiple join columns and source subqueries #136

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds MERGE INTO transform #109

feat: Adds MERGE INTO transform #109

jsibbison-square commented Jun 27, 2024

tekumara commented Jun 29, 2024 •

edited

Loading

jsibbison-square commented Jun 30, 2024

jsibbison-square commented Jul 11, 2024

tekumara commented Aug 23, 2024

jazzarati commented Aug 25, 2024

jsibbison-square commented Sep 5, 2024

tekumara commented Sep 8, 2024 •

edited

Loading

tekumara commented Sep 15, 2024

jazzarati commented Sep 15, 2024

tekumara commented Sep 16, 2024

feat: Adds MERGE INTO transform #109

feat: Adds MERGE INTO transform #109

Conversation

jsibbison-square commented Jun 27, 2024

tekumara commented Jun 29, 2024 • edited Loading

jsibbison-square commented Jun 30, 2024

jsibbison-square commented Jul 11, 2024

tekumara commented Aug 23, 2024

jazzarati commented Aug 25, 2024

jsibbison-square commented Sep 5, 2024

tekumara commented Sep 8, 2024 • edited Loading

tekumara commented Sep 15, 2024

jazzarati commented Sep 15, 2024

tekumara commented Sep 16, 2024

tekumara commented Jun 29, 2024 •

edited

Loading

tekumara commented Sep 8, 2024 •

edited

Loading