Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds MERGE INTO transform #109

Merged
merged 32 commits into from
Sep 15, 2024

Conversation

jsibbison-square
Copy link
Contributor

This commit adds the ability to convert snowflakes MERGE INTO functionality into a functional equivalent implementation in duckdb. To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed independently.

This commit only adds the transform, there is more refactoring required in fakes.py in order to handle a transform that transforms a single expression into multiple expressions so it cannot be used yet. I have a change baking for adding it into fakes but wanted to separate PRs.

Let me know if you need more tests or anything as its quite a complicated feature.

@jsibbison-square jsibbison-square force-pushed the jsibbison-20240627-merge-into branch 3 times, most recently from b7b996b to 3d2ce14 Compare June 27, 2024 10:39
This commit adds the ability to convert snowflakes [MERGE INTO](https://docs.snowflake.com/en/sql-reference/sql/merge) functionality into a functional equivalent implementation in duckdb.
To do this we need to break apart the WHEN [NOT] MATCHED syntax into separate statements to be executed indepedently.

This commit only adds the transform, there is more refactoring required in fakes.py in order to handle a transform
that transforms a single expression into multiple expressions.
@tekumara
Copy link
Owner

tekumara commented Jun 29, 2024

Thanks for starting this! It's a tricky feature.

I hope you don't mind but I've pushed 40ada94 as a test case to cover. It comes from the Snowflake MERGE docs. It provides good coverage with a mix of delete, update and insert branches with some conditionals:

MERGE INTO t1 USING t2 ON t1.t1Key = t2.t2Key
    WHEN MATCHED AND t2.marked = 1 THEN DELETE
    WHEN MATCHED AND t2.isNewStatus = 1 THEN UPDATE SET val = t2.newVal, status = t2.newStatus
    WHEN MATCHED THEN UPDATE SET val = t2.newVal
    WHEN NOT MATCHED THEN INSERT (t1Key, val, status) VALUES (t2.t2Key, t2.newVal, t2.newStatus);

It also demonstrate that MERGE needs to:

  1. return a result set like this
  2. have branch conditions that are isolated and don't interact. Specifically in this example the last INSERT branch should not re-add rows removed by the DELETE branch.

One way to solve these is to calculate the merge operations up front using the original table and store then as a MERGE_OP column in a temporary table. This ensures the branch conditions work with the state of the original table, not the table after it has already been modified by a DML statement run for a previous branch.

So something like this for the test case above:

CREATE OR REPLACE TEMPORARY TABLE merge_candidates AS
SELECT
    t1.t1Key AS t1_t1Key,
    t1.val AS t1_val,
    t1.status AS t1_status,
    t2.t2Key AS t2_t2Key,
    t2.marked AS t2_marked,
    t2.isNewStatus AS t2_isNewStatus,
    t2.newVal AS t2_newVal,
    t2.newStatus AS t2_newStatus,
    CASE
        WHEN t1.t1Key = t2.t2Key AND t2.marked = 1 THEN 1           -- DELETE
        WHEN t1.t1Key = t2.t2Key and t2.isNewStatus = 1 THEN 2      -- UPDATE with conditional
        WHEN t1.t1Key = t2.t2Key THEN 3                             -- UPDATE the rest
        WHEN t1.t1Key IS NULL THEN 4                                -- INSERT remaining matches
        ELSE 0                                                      -- IGNORE
    END AS MERGE_OP
    FROM t1
FULL OUTER JOIN t2 ON t1.t1Key = t2.t2Key
WHERE MERGE_OP > 0;

This can then be used to drive the DML statements for each branch, and provide the result set, like this:

DELETE FROM t1
USING merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1key
  AND mc.merge_op = 1;

UPDATE t1
SET val = mc.t2_newVal,
    status = mc.t2_newStatus
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 2;

UPDATE t1
SET val = mc.t2_newVal
FROM merge_candidates AS mc
WHERE t1.t1Key = mc.t1_t1Key
AND mc.merge_op = 3;

INSERT INTO t1 (t1Key, val, status)
SELECT mc.t2_t2Key, mc.t2_newVal, mc.t2_newStatus
FROM merge_candidates AS mc
WHERE mc.merge_op = 4;

SELECT COUNT_IF(merge_op in (4)) AS "number of rows inserted", COUNT_IF(merge_op in (2,3)) AS "number of rows updated", COUNT_IF(merge_op in(1)) AS "number of rows deleted"
from merge_candidates;

Let me know what you think, and if this makes sense!

@jsibbison-square
Copy link
Contributor Author

Thanks for the detailed feedback, seems merge is a lot more intricate than my usecase of upsert 😂 . I'm on holiday next week so it might be awhile before you see more progress on this. Thanks for the testcase.

* main:
  chore(main): release 0.9.20 (tekumara#115)
  fix: $$  not considered a variable
  fix: concurrent connection write-write conflict
  refactor: extract test_connect.py
  feat: SHOW PRIMARY KEYS for table (tekumara#114)
  chore(main): release 0.9.19 (tekumara#113)
  feat: Implements basic snowflake session variables via SET/UNSET (tekumara#111)
  chore(deps-dev): bump pyright from 1.1.366 to 1.1.369 (tekumara#112)
  chore(main): release 0.9.18 (tekumara#110)
Got basic case for merge working with a merge2 function, needs to be cleaned up, deleting original merge, fix result statement.
@jsibbison-square jsibbison-square changed the title feat: Adds MERGE INTO transform WIP: feat: Adds MERGE INTO transform Jul 11, 2024
@jsibbison-square
Copy link
Contributor Author

Marked PR as WIP while I work through getting this functionality fully running. I've got the new test case running with a second attempt but I've got lots to clean up before its ready for review.

I'll mention you when its ready for review again.

@tekumara
Copy link
Owner

How did you go @jsibbison-square? Do you need any help with this?

@jazzarati
Copy link

How did you go @jsibbison-square? Do you need any help with this?

Hi @tekumara sorry I've run short on time at the moment. This was working as far as I was concerned but I haven't had time to refactor as its a bit messy. Happy for it to be merged as is and refactor one day or you can take over if you want the feature sooner.

* main: (38 commits)
  feat(server): support time & timestamp types
  ci(server): description - cover more types
  feat(server): handle snowflake ProgrammingError
  refactor: use rowtype for sf metadata in arrow schema
  feat(server): support empty result set
  chore(main): release 0.9.24 (tekumara#128)
  fix: don't require pandas at import time
  chore(main): release 0.9.23 (tekumara#126)
  feat: support conn.is_closed()
  refactor: extract conn, cursor, pandas_tools
  refactor: describe_as_rowtype uses sf_type
  refactor: describe_as_rowtype
  refactor: extract describe_as_result_metadata
  feat(server): support cur.description
  feat(server): support bool, int, float types
  chore(main): release 0.9.22 (tekumara#124)
  wip(server): add /queries/v1/abort-request
  fix: fetchmany supports irregular sizes
  fix: column types for DESCRIBE
  feat: describe view information_schema.*
  ...
@jsibbison-square
Copy link
Contributor Author

@tekumara I've now got a passing build, isolated merge logic into its own files and added some documentation so I can hope to recall how it works in the future. Please review 🙏

@jsibbison-square jsibbison-square changed the title WIP: feat: Adds MERGE INTO transform feat: Adds MERGE INTO transform Sep 5, 2024
This reverts commit dfa68e1.

because it doesn't work when the alias is for a select expression

eg: tekumara#24 (comment)
@tekumara
Copy link
Owner

tekumara commented Sep 8, 2024

Thanks so much @jsibbison-square for this significant piece of work. I've pushed some refactoring to use pure functions rather than a class, as there's no permanent state to capture, and turned your comment documenting the sql output into a test.

I'm also considering whether it's worth refactoring this to have a single merge_candidates table and statement to generate the merge operations as the comment above, rather than the two tables and statement per when clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔

The main thing blocking merging this though, is it doesn't yet solve for a USING clause with a SELECT expression. See the example here. I'm not expecting you to solve this btw, I'm happy to noodle on it.

@tekumara tekumara merged commit d5e14a7 into tekumara:main Sep 15, 2024
1 check passed
@tekumara
Copy link
Owner

Thanks again @jsibbison-square - I've merged this and will address the above in a follow-up PR.

@jazzarati
Copy link

I'm also considering whether it's worth refactoring this to have a single merge_candidates table and statement to generate the merge operations as the #109 (comment), rather than the two tables and statement per when clause you have, to simplify the logic and the generated sql, unless you suspect it needs to be as is? 🤔

I did start with one table but then went with two, I don't recall exactly why sadly.

Thanks for merging in its partial support form, we have a lot of usecases that are the most basic of merges so it'll be great to start using.

@tekumara
Copy link
Owner

Glad to hear you'll be using this soon. I've raised #136

tekumara added a commit that referenced this pull request Sep 16, 2024
…136)

Supports MERGE with
- multiple join columns, that may have the same name
- when the source is a subquery

To do this, I've refactored the way merge works to align with the
[approach suggested
here](#109 (comment)),
ie:

- use a single `merge_candidates` temporary table to map rows to
mutation operations (delete, update, insert)
- separate candidate table creation, mutation operations (delete,
update, insert), and counts into separate functions, rather than
co-mingling this functionality, for better readability
- use join columns rather than rowids, and avoid using a transaction
because duckdb doesn't support nested transactions so the current
implementation will error if an existing transaction is active

resolves #24
tekumara pushed a commit that referenced this pull request Sep 16, 2024
🤖 I have created a release *beep* *boop*
---


##
[0.9.25](v0.9.24...v0.9.25)
(2024-09-16)


### Features

* Adds MERGE INTO transform
([#109](#109))
([d5e14a7](d5e14a7))
* close duckdb connection
([223f8e2](223f8e2))
* **server:** handle snowflake ProgrammingError
([9455a43](9455a43))
* **server:** support empty result set
([b967b69](b967b69))
* **server:** support FAKESNOW_DB_PATH
([af79f77](af79f77))
* **server:** support time & timestamp types
([1606a3e](1606a3e))
* support MERGE with multiple join columns and source subqueries
([#136](#136))
([9b5a7a0](9b5a7a0)),
closes [#24](#24)


### Chores

* **deps-dev:** bump pyright from 1.1.374 to 1.1.378
([#133](#133))
([593a420](593a420))
* **deps:** update ruff requirement from ~=0.5.1 to ~=0.6.3
([#130](#130))
([6b37d8b](6b37d8b))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: potatobot-prime[bot] <132267321+potatobot-prime[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants