Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1060064: Eliminate repeated subquery using CTE (part 1, with original query generation framework) #1274

Merged
merged 6 commits into from
Mar 6, 2024

Conversation

sfc-gh-jdu
Copy link
Collaborator

@sfc-gh-jdu sfc-gh-jdu commented Feb 26, 2024

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1060064

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

    This PR adds an optimization to eliminate repeated subquery using CTE (see details in https://docs.google.com/document/d/1vVUYqLeD_nQRVaH3SX2c4jDNJoBGHJlg8PqiJ8qxgVM/edit), with original query generation framework.

The basic idea is to convert repeated subqueries to a CTE, e.g.,
( SELECT * FROM ( SELECT * FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT))) WHERE ("A" = 1 :: INT))) UNION ALL ( SELECT * FROM ( SELECT * FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT))) WHERE ("A" = 1 :: INT)))
will be converted to
WITH SNOWPARK_TEMP_TABLE_SD2LMA9GKT AS ( SELECT * FROM ( SELECT * FROM ( SELECT "A", "B" FROM ( SELECT $1 AS "A", $2 AS "B" FROM VALUES (1 :: INT, 2 :: INT), (3 :: INT, 4 :: INT))) WHERE ("A" = 1 :: INT))) ( SELECT * FROM (SNOWPARK_TEMP_TABLE_SD2LMA9GKT)) UNION ALL ( SELECT * FROM (SNOWPARK_TEMP_TABLE_SD2LMA9GKT))

@sfc-gh-jdu sfc-gh-jdu requested a review from a team as a code owner February 26, 2024 23:14
@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1060064-cte-original branch from c6fc3de to ba5220f Compare February 28, 2024 23:30
@sfc-gh-jdu sfc-gh-jdu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Feb 28, 2024
Comment on lines 357 to 361
with_stmt = cte_statement(
list(duplicate_plan_to_cte_map.values()),
list(duplicate_plan_to_table_name_map.values()),
)
final_query = with_stmt + SPACE + plan_to_query_map[self]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are finding the topmost duplicate node, we don't need to concern ourselves with finding the correct order, right. This is cool

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, because we build duplicate_plan_to_cte_map in post order, where children node is ahead of parent node. It satisfies the requirement of with clause:

A WITH clause can refer recursively to itself, and to other CTEs that appear earlier in the same clause. For instance, cte_name2 can refer to cte_name1 and itself, while cte_name1 can refer to itself, but not to cte_name2.


def cte_statement(queries: List[str], table_names: List[str]) -> str:
return WITH + COMMA.join(
table_name + AS + LEFT_PARENTHESIS + query + RIGHT_PARENTHESIS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use string format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This function is used to only include nodes that should be converted to CTEs.
"""
node_count_map = defaultdict(int)
node_parents_map = {root: set()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a defaultdict is good for this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 285 to 307
def traverse(node: SnowflakePlan) -> None:
node_count_map[node] += 1
if node.source_plan and node.source_plan.children:
for child in node.source_plan.children:
if child not in node_parents_map:
node_parents_map[child] = {node}
else:
node_parents_map[child].add(node)
traverse(child)

def is_duplicate_subtree(node: SnowflakePlan) -> bool:
is_duplicate_node = node_count_map[node] > 1
if is_duplicate_node:
is_any_parent_unique_node = any(
node_count_map[n] == 1 for n in node_parents_map[node]
)
if is_any_parent_unique_node:
return True
else:
has_multi_parents = len(node_parents_map[node]) > 1
if has_multi_parents:
return True
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding some unit test code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, very good point, added

duplicate_plan_to_cte_map = {}
duplicate_plan_to_table_name_map = {}

def build_plan_to_query_map_in_post_order(node: SnowflakePlan) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Unit tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it's difficult/not worthwhile to write unit test for this function because what we're going to test is the generate sql, and I need to create many fake queries in unit tests. It's already tested in integration tests. But I moved them to a util file, which is more clear to review the code

@@ -272,15 +416,16 @@ def output_dict(self) -> Dict[str, Any]:

def __copy__(self) -> "SnowflakePlan":
return SnowflakePlan(
self.queries.copy() if self.queries else [],
copy.deepcopy(self.queries) if self.queries else [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is deepcopy required here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because self.queries contains Query object, instead of query string, which can be used in original query plan object

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1060064-cte-original branch from 2b60dee to 6fcf71f Compare March 5, 2024 21:46
@sfc-gh-jdu sfc-gh-jdu enabled auto-merge (squash) March 6, 2024 00:31
@sfc-gh-jdu sfc-gh-jdu merged commit 3e34c73 into main Mar 6, 2024
54 checks passed
@sfc-gh-jdu sfc-gh-jdu deleted the jdu-SNOW-1060064-cte-original branch March 6, 2024 00:33
@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants