Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CH] Duplicated expression evaluation in project before aggregate operator due to PullOutPreProject Rule #8183

Open
taiyang-li opened this issue Dec 9, 2024 · 8 comments · May be fixed by #8295
Assignees
Labels
bug Something isn't working triage

Comments

@taiyang-li
Copy link
Contributor

taiyang-li commented Dec 9, 2024

Backend

CH (ClickHouse)

Bug description

In below query, expr if(id % 2 = 0, id+3, id+4) was evaluated twice both in n3 and _pre_5. I find that _pre_5 is introduced by PullOutPreProject in #4213. cc @PHILO-HE @zhztheplayer can you find similar issue in Velox?

explain extended 
select max(n1), max(n2), sum(if(n1+n2+n3 % 2 = 0, 1, 0)) from (select id+1 as n1, id+2 as n2, if(id % 2 = 0, id+3, id+4) as n3 from range(10)) 
== Optimized Logical Plan ==
Aggregate [max(n1#375L) AS max(n1)#386L, max(n2#376L) AS max(n2)#387L, sum(if ((((n1#375L + n2#376L) + (n3#377L % 2)) = 0)) 1 else 0) AS sum((IF((((n1 + n2) + (n3 % 2)) = 0), 1, 0)))#389L]
+- Project [(id#383L + 1) AS n1#375L, (id#383L + 2) AS n2#376L, if (((id#383L % 2) = 0)) (id#383L + 3) else (id#383L + 4) AS n3#377L]
   +- Range (0, 10, step=1, splits=None)

== Physical Plan ==
CHNativeColumnarToRow
+- ^(9) HashAggregateTransformer(keys=[], functions=[max(n1#375L), max(n2#376L), sum(_pre_5#396)], isStreamingAgg=false, output=[max(n1)#386L, max(n2)#387L, sum((IF((((n1 + n2) + (n3 % 2)) = 0), 1, 0)))#389L])
   +- ^(9) ProjectExecTransformer [(id#383L + 1) AS n1#375L, (id#383L + 2) AS n2#376L, if (((id#383L % 2) = 0)) (id#383L + 3) else (id#383L + 4) AS n3#377L, if (((((id#383L + 1) + (id#383L + 2)) + (if (((id#383L % 2) = 0)) (id#383L + 3) else (id#383L + 4) % 2)) = 0)) 1 else 0 AS _pre_5#396]
      +- ^(9) InputIteratorTransformer[id#383L]
         +- RowToCHNativeColumnar
            +- *(1) Range (0, 10, step=1, splits=1)

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@taiyang-li taiyang-li added bug Something isn't working triage labels Dec 9, 2024
@taiyang-li taiyang-li changed the title [CH] Duplicated expression evaluation caused by PullOutPreProject Rule [CH] Duplicated expression evaluation in project before aggregate operator caused by PullOutPreProject Rule Dec 9, 2024
@taiyang-li taiyang-li changed the title [CH] Duplicated expression evaluation in project before aggregate operator caused by PullOutPreProject Rule [CH] Duplicated expression evaluation in project before aggregate operator due to PullOutPreProject Rule Dec 9, 2024
@PHILO-HE
Copy link
Contributor

@taiyang-li, just tested Velox backend. It has the same issue.

(5) ProjectExecTransformer
Output [4]: [(id#3036L + cast(1 as bigint)) AS n1#3033L, (id#3036L + cast(2 as bigint)) AS n2#3034L, if (((id#3036L % cast(2 as bigint)) = cast(0 as bigint))) (id#3036L + cast(3 as bigint)) else (id#3036L + cast(4 as bigint)) AS n3#3035L, if (((((id#3036L + cast(1 as bigint)) + (id#3036L + cast(2 as bigint))) + (if (((id#3036L % cast(2 as bigint)) = cast(0 as bigint))) (id#3036L + cast(3 as bigint)) else (id#3036L + cast(4 as bigint)) % cast(2 as bigint))) = cast(0 as bigint))) 1 else 0 AS _pre_0#3065]
Input [1]: [id#3036L]

@PHILO-HE
Copy link
Contributor

@liujiayi771, could you help take a look?

@liujiayi771
Copy link
Contributor

I will take a look.

@taiyang-li
Copy link
Contributor Author

taiyang-li commented Dec 20, 2024

@liujiayi771 in gluten we already had a rule CommonSubexpressionEliminateRule to eliminate common subexpressions, but it is applied to logical plan. Is there any possibility that PullOutPreProject rule applys to logical plan instead of physical plan.

@liujiayi771
Copy link
Contributor

@taiyang-li This rule was discussed a long time ago to be implemented in the logical plan, and we did indeed implement a version based on the logical plan before. There are many cases that are not supported if we only rely on the logical plan, for example, the aggregation generated by Bloom filters can only be seen in the physical plan. I remember there are quite a few such cases. Theoretically, duplicates should not appear here because there is logic to eliminate duplicates; I need to debug and take a look at this.

@liujiayi771
Copy link
Contributor

liujiayi771 commented Dec 20, 2024

@PHILO-HE @taiyang-li The reason for the issue here is that after preProject is extracted, when it is merged with the previous project in the CollapseProjectExecTransformer rule, n3 must be replaced with the actual computed expression; otherwise, it will not be able to bindReference.

To eliminate duplicate executions, in this case, the pre-project that is pulled out should not be merged within the CollapseProjectExecTransformer, which would result in an additional project operator being created. However, this would prevent the expression from being computed multiple times. What do you think?

image

cc @ulysses-you Do you have any suggestions?

@taiyang-li
Copy link
Contributor Author

@liujiayi771 it is ok for clickhouse backend, because CH plan optimization would merge adjacent project operators. Besides, whether they are merged or not in CH doesn't have more impacts than duplicated expression evaluations.

@liujiayi771
Copy link
Contributor

@taiyang-li I had a discussion with @ulysses-you offline, and in this case, we need to implement a more refined collapsing strategy. We will remove the column for n3 in the child project and keep only the calculation of n3 from the pre-project. I will work on this part of the development next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants