SNOW-1060064: Eliminate repeated subquery using CTE (part 1, with original query generation framework) #1274

sfc-gh-jdu · 2024-02-26T23:14:05Z

Please answer these questions before submitting your pull requests. Thanks!

What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1060064
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
Please describe how your code solves the related issue.

This PR adds an optimization to eliminate repeated subquery using CTE (see details in https://docs.google.com/document/d/1vVUYqLeD_nQRVaH3SX2c4jDNJoBGHJlg8PqiJ8qxgVM/edit), with original query generation framework.

The basic idea is to convert repeated subqueries to a CTE, e.g.,
( SELECT * FROM ( SELECT * FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT))) WHERE ("A" = 1 :: INT))) UNION ALL ( SELECT * FROM ( SELECT * FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT))) WHERE ("A" = 1 :: INT)))
will be converted to
WITH SNOWPARK_TEMP_TABLE_SD2LMA9GKT AS ( SELECT * FROM ( SELECT * FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT))) WHERE ("A" = 1 :: INT))) ( SELECT * FROM (SNOWPARK_TEMP_TABLE_SD2LMA9GKT)) UNION ALL ( SELECT * FROM (SNOWPARK_TEMP_TABLE_SD2LMA9GKT))

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

src/snowflake/snowpark/_internal/analyzer/snowflake_plan_node.py

src/snowflake/snowpark/session.py

sfc-gh-aalam · 2024-03-01T18:55:50Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

+        with_stmt = cte_statement(
+            list(duplicate_plan_to_cte_map.values()),
+            list(duplicate_plan_to_table_name_map.values()),
+        )
+        final_query = with_stmt + SPACE + plan_to_query_map[self]


since we are finding the topmost duplicate node, we don't need to concern ourselves with finding the correct order, right. This is cool

Yeah, because we build duplicate_plan_to_cte_map in post order, where children node is ahead of parent node. It satisfies the requirement of with clause:

A WITH clause can refer recursively to itself, and to other CTEs that appear earlier in the same clause. For instance, cte_name2 can refer to cte_name1 and itself, while cte_name1 can refer to itself, but not to cte_name2.

sfc-gh-yixie · 2024-03-04T17:43:12Z

src/snowflake/snowpark/_internal/analyzer/analyzer_utils.py

+
+def cte_statement(queries: List[str], table_names: List[str]) -> str:
+    return WITH + COMMA.join(
+        table_name + AS + LEFT_PARENTHESIS + query + RIGHT_PARENTHESIS


Better use string format

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

sfc-gh-yixie · 2024-03-04T17:59:35Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

+            This function is used to only include nodes that should be converted to CTEs.
+            """
+            node_count_map = defaultdict(int)
+            node_parents_map = {root: set()}


I feel a defaultdict is good for this case.

sfc-gh-yixie · 2024-03-04T18:02:20Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

+            def traverse(node: SnowflakePlan) -> None:
+                node_count_map[node] += 1
+                if node.source_plan and node.source_plan.children:
+                    for child in node.source_plan.children:
+                        if child not in node_parents_map:
+                            node_parents_map[child] = {node}
+                        else:
+                            node_parents_map[child].add(node)
+                        traverse(child)
+
+            def is_duplicate_subtree(node: SnowflakePlan) -> bool:
+                is_duplicate_node = node_count_map[node] > 1
+                if is_duplicate_node:
+                    is_any_parent_unique_node = any(
+                        node_count_map[n] == 1 for n in node_parents_map[node]
+                    )
+                    if is_any_parent_unique_node:
+                        return True
+                    else:
+                        has_multi_parents = len(node_parents_map[node]) > 1
+                        if has_multi_parents:
+                            return True
+                return False


How about adding some unit test code?

Yes, very good point, added

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

sfc-gh-yixie · 2024-03-04T18:04:48Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

+        duplicate_plan_to_cte_map = {}
+        duplicate_plan_to_table_name_map = {}
+
+        def build_plan_to_query_map_in_post_order(node: SnowflakePlan) -> None:


Same here. Unit tests.

I found it's difficult/not worthwhile to write unit test for this function because what we're going to test is the generate sql, and I need to create many fake queries in unit tests. It's already tested in integration tests. But I moved them to a util file, which is more clear to review the code

sfc-gh-yixie · 2024-03-04T18:05:36Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

@@ -272,15 +416,16 @@ def output_dict(self) -> Dict[str, Any]:

    def __copy__(self) -> "SnowflakePlan":
        return SnowflakePlan(
-            self.queries.copy() if self.queries else [],
+            copy.deepcopy(self.queries) if self.queries else [],


Why is deepcopy required here?

because self.queries contains Query object, instead of query string, which can be used in original query plan object

sfc-gh-jdu requested review from sfc-gh-yixie and sfc-gh-aalam February 26, 2024 23:14

sfc-gh-jdu requested a review from a team as a code owner February 26, 2024 23:14

sfc-gh-jdu requested a review from sfc-gh-jrose February 26, 2024 23:14

sfc-gh-jdu commented Feb 26, 2024

View reviewed changes

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py Outdated Show resolved Hide resolved

sfc-gh-aalam reviewed Feb 27, 2024

View reviewed changes

src/snowflake/snowpark/_internal/analyzer/snowflake_plan_node.py Outdated Show resolved Hide resolved

sfc-gh-jdu force-pushed the jdu-SNOW-1060064-cte-original branch from c6fc3de to ba5220f Compare February 28, 2024 23:30

sfc-gh-jdu requested a review from sfc-gh-aalam February 28, 2024 23:31

sfc-gh-jdu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Feb 28, 2024

sfc-gh-aalam reviewed Feb 29, 2024

View reviewed changes

src/snowflake/snowpark/session.py Show resolved Hide resolved

sfc-gh-aalam reviewed Mar 1, 2024

View reviewed changes

sfc-gh-aalam approved these changes Mar 1, 2024

View reviewed changes

sfc-gh-yixie reviewed Mar 4, 2024

View reviewed changes

sfc-gh-yixie approved these changes Mar 5, 2024

View reviewed changes

sfc-gh-jdu added 6 commits March 5, 2024 13:46

init

16def9e

fix bind query

85c5466

another case

c3832a7

fix

73650b7

address comment

e97a05a

copy

6fcf71f

sfc-gh-jdu force-pushed the jdu-SNOW-1060064-cte-original branch from 2b60dee to 6fcf71f Compare March 5, 2024 21:46

sfc-gh-jdu enabled auto-merge (squash) March 6, 2024 00:31

sfc-gh-jdu merged commit 3e34c73 into main Mar 6, 2024
54 checks passed

sfc-gh-jdu deleted the jdu-SNOW-1060064-cte-original branch March 6, 2024 00:33

github-actions bot locked and limited conversation to collaborators Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1060064: Eliminate repeated subquery using CTE (part 1, with original query generation framework) #1274

SNOW-1060064: Eliminate repeated subquery using CTE (part 1, with original query generation framework) #1274

sfc-gh-jdu commented Feb 26, 2024 •

edited

Loading

sfc-gh-aalam Mar 1, 2024

sfc-gh-jdu Mar 1, 2024

sfc-gh-yixie Mar 4, 2024

sfc-gh-jdu Mar 5, 2024

sfc-gh-yixie Mar 4, 2024

sfc-gh-jdu Mar 5, 2024

sfc-gh-yixie Mar 4, 2024

sfc-gh-jdu Mar 5, 2024

sfc-gh-yixie Mar 4, 2024

sfc-gh-jdu Mar 5, 2024

sfc-gh-yixie Mar 4, 2024

sfc-gh-jdu Mar 4, 2024

SNOW-1060064: Eliminate repeated subquery using CTE (part 1, with original query generation framework) #1274

SNOW-1060064: Eliminate repeated subquery using CTE (part 1, with original query generation framework) #1274

Conversation

sfc-gh-jdu commented Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-jdu commented Feb 26, 2024 •

edited

Loading