HybridParquetScan: Refine filter push down to avoid double evaluation #12000

thirtiseven · 2025-01-22T15:55:46Z

In the current code, a HybridParquetScan followed by a Filter will result in all conditions being pushed down to the CPU, but still remaining in the Filter at the same time, so the Filter conditions are evaluated twice. Usually the second evaluation is quite fast, so this won't be a big problem. But if there are some conditions that are not supported by CPU or GPU, it will cause some problems.

This PR adds a rule to check each condition in filterExec before overriding:

If a filter condition is not supported by either CPU or GPU, it will fallback to CPU in FilterExec and not push down to CPU.
If a filter condition is only supported by the CPU, this pr pushes it down to the scan and removes it in FilterExec.
If a filter condition is only supported by the GPU, this pr keeps it in the filter.
If all conditions are pushed down to the scan, FilterExec is removed.

The supportedByHybridFilters is from velox-backend-support-progress in Gluten. Here is the script to extract CPU supported exprs, gist.

For example:

scala> val df = spark.read.parquet("parse_url_protocol")
df: org.apache.spark.sql.DataFrame = [url: string, pr: string ... 2 more fields]


scala> df.filter("startswith(pr, 'h') == False and ascii(url) >= 16").show()
25/01/23 16:32:31 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(count#2L as string) AS count#32 will run on GPU
      *Expression <Cast> cast(count#2L as string) will run on GPU
    *Exec <FilterExec> will run on GPU
      *Expression <Not> NOT StartsWith(pr#1, h) will run on GPU
        *Expression <StartsWith> StartsWith(pr#1, h) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

startswith is not supported in CPU so it will be kept in GPU, and ascii pushed down to CPU. The check is recursive.

and

scala> df.filter("url >= 'http' and ascii(url) >= 16").show()
25/01/23 16:37:45 WARN GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> cast(count#2L as string) AS count#66 will run on GPU
      *Expression <Cast> cast(count#2L as string) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

if all filters are supported, the FilterExec will be removed.

Signed-off-by: Haoyang Li <[email protected]>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Signed-off-by: Haoyang Li <[email protected]>

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridExecutionUtils.scala

…dExecutionUtils.scala Co-authored-by: Alfred Xu <[email protected]>

winningsix · 2025-01-23T09:42:22Z

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridExecutionUtils.scala

+  }
+
+  val supportedByHybridFilters = {
+    // Only fully supported functions are listed here


Have a link to the supporting list in the comments.

winningsix · 2025-01-23T09:45:51Z

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridExecutionUtils.scala

+    }
+  }
+
+  def recursivelySupportsHybridFilters(condition: Expression): Boolean = {


nit: isExprSupportedByHybridScan better name?

winningsix · 2025-01-23T09:51:42Z

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridExecutionUtils.scala

+    }
+  }
+
+  def recursivelySupportsHybridFilters(condition: Expression): Boolean = {


There is a way to register UDF in Gluten as well (link). Hmm, probably we can have a whitelist configuration allowing pre-registered function into the pushed-down filters.

Added a whitelist config as the gluten doc seems to be out of date for some expressions. Maybe this will allow UDF to be pushed down to the CPU, but I haven't tested it. Will file a follow up for this.

sperlingxx

LGTM. But I have not worked on the GpuOverrides for a long time......

sperlingxx · 2025-01-23T09:55:55Z

build

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

…cala Co-authored-by: Alfred Xu <[email protected]>

Signed-off-by: Haoyang Li <[email protected]>

res-life · 2025-01-24T06:26:43Z

integration_tests/src/main/python/hybrid_parquet_test.py

+    with_cpu_session(
+        lambda spark: gen_df(spark, [('a', StringGen(pattern='[0-9]{1,5}'))]).write.parquet(data_path),
+        conf=rebase_write_corrected_conf)
+    # filter conditions should remain on the GPU


How to verify the executed plan?

I assumed that the startsWith is not supported in CPU but actually it does but not in their doc. If we pushed a unsupported operator to CPU the test should failed. I will try to find an expr that hybrid is not supported in this test.

Do you think it is necessary to write some UT to verify the executed plan?

It will be straightforward if checking execution plan using UT.
IMO, we may use UT instead of IT. @sperlingxx what do you think?

Changed this one to a pandas_udf for now. I think some IT are still necessary, will try to write some UT later.

Updated the integration tests to verify the executed plan, PTAL

res-life · 2025-01-24T06:26:56Z

integration_tests/src/main/python/hybrid_parquet_test.py

+        lambda spark: gen_df(spark, [('a', StringGen(pattern='[0-9]{1,5}'))]).write.parquet(data_path),
+        conf=rebase_write_corrected_conf)
+    # filter conditions should be pushed down to the CPU, so the ascii will not fall back to CPU in the FilterExec
+    assert_gpu_and_cpu_are_equal_collect(


How to verify the executed plan?

ascii is not supported, so if it doesn't push down the test will fail.

res-life · 2025-01-24T06:28:38Z

integration_tests/src/main/python/hybrid_parquet_test.py

+
+    with_cpu_session(lambda spark: spark.udf.register("udf_fallback", udf_fallback))
+
+    assert_gpu_and_cpu_are_equal_collect(


How to verify the executed plan?

It is an UDF and not supported in CPU so if it pushed down the test should failed.

res-life · 2025-01-24T07:56:40Z

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridExecutionUtils.scala

+   * support it. After that we can remove the condition from one side to avoid duplicate execution
+   * or unnecessary fallback/crash.
+   */
+  def applyHybridScanRules(plan: SparkPlan, conf: RapidsConf): SparkPlan = {


Rename to tryToApplyHybridScanRules?
And at the first line of this function, check if Hybrid feature is enabled to avoid executing the following code.

res-life · 2025-01-24T07:59:42Z

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridExecutionUtils.scala

+
+  def canBePushedToHybrid(child: SparkPlan, conf: RapidsConf): String = {
+    child match {
+      case fsse: FileSourceScanExec if HybridFileSourceScanExecMeta.useHybridScan(conf, fsse) =>


Better to move HybridFileSourceScanExecMeta.useHybridScan to outer function.

How about moving the function useHybridScan to HybridExecutionUtils?
And maybe all other functions in object HybridFileSourceScanExecMeta can be moved there too, what do you think?

How about moving the function useHybridScan to HybridExecutionUtils?
Yes, good idea.

Signed-off-by: Haoyang Li <[email protected]>

res-life · 2025-01-26T04:23:58Z

Did a NDS test, total time improved from 709s to 704s.

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2025-01-26T14:00:07Z

build

thirtiseven · 2025-01-26T14:01:08Z

build

sperlingxx

LGTM! Good job!

res-life

LGTM

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2025-01-27T04:00:00Z

build

thirtiseven · 2025-01-27T05:06:15Z

Addressed some comments from @res-life in offline sync, pls take another look

thirtiseven · 2025-01-27T07:26:33Z

build

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven · 2025-01-27T11:20:05Z

build

thirtiseven added 5 commits January 22, 2025 14:08

wip

86349ce

Signed-off-by: Haoyang Li <[email protected]>

rule based change

2b7d0a2

Signed-off-by: Haoyang Li <[email protected]>

clean up

8d33fcd

Signed-off-by: Haoyang Li <[email protected]>

clean up

89efe0d

Signed-off-by: Haoyang Li <[email protected]>

style

383965c

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven self-assigned this Jan 22, 2025

thirtiseven marked this pull request as ready for review January 22, 2025 16:56

thirtiseven requested review from res-life, sperlingxx and revans2 January 22, 2025 16:56

remove some expr for now

ecff0f6

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven requested a review from winningsix January 22, 2025 17:06

sperlingxx requested changes Jan 23, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

thirtiseven added 3 commits January 23, 2025 15:05

address comment

5c06d46

Signed-off-by: Haoyang Li <[email protected]>

recuesive check

81a4f76

Signed-off-by: Haoyang Li <[email protected]>

test update

6abeaeb

Signed-off-by: Haoyang Li <[email protected]>

sperlingxx reviewed Jan 23, 2025

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/HybridExecutionUtils.scala Outdated Show resolved Hide resolved

Update sql-plugin/src/main/scala/org/apache/spark/rapids/hybrid/Hybri…

8f9a05a

…dExecutionUtils.scala Co-authored-by: Alfred Xu <[email protected]>

winningsix reviewed Jan 23, 2025

View reviewed changes

sperlingxx previously approved these changes Jan 23, 2025

View reviewed changes

sperlingxx reviewed Jan 24, 2025

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Outdated Show resolved Hide resolved

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.s…

86cc3d4

…cala Co-authored-by: Alfred Xu <[email protected]>

thirtiseven dismissed sperlingxx’s stale review via 86cc3d4 January 24, 2025 06:17

address comments

5bdd3c9

Signed-off-by: Haoyang Li <[email protected]>

res-life reviewed Jan 24, 2025

View reviewed changes

address comments

ca4641b

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven added 2 commits January 26, 2025 20:33

test on plan

eecf6de

Signed-off-by: Haoyang Li <[email protected]>

style fix

dcf8dcb

Signed-off-by: Haoyang Li <[email protected]>

sperlingxx previously approved these changes Jan 27, 2025

View reviewed changes

res-life previously approved these changes Jan 27, 2025

View reviewed changes

address comments

0c24b7a

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven dismissed stale reviews from res-life and sperlingxx via 0c24b7a January 27, 2025 03:59

res-life previously approved these changes Jan 27, 2025

View reviewed changes

fix it failed

9f09fb6

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven dismissed res-life’s stale review via 9f09fb6 January 27, 2025 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HybridParquetScan: Refine filter push down to avoid double evaluation #12000

HybridParquetScan: Refine filter push down to avoid double evaluation #12000

thirtiseven commented Jan 22, 2025 •

edited

Loading

winningsix Jan 23, 2025

thirtiseven Jan 24, 2025

winningsix Jan 23, 2025

thirtiseven Jan 24, 2025

winningsix Jan 23, 2025

thirtiseven Jan 24, 2025

sperlingxx left a comment •

edited

Loading

sperlingxx commented Jan 23, 2025

res-life Jan 24, 2025 •

edited

Loading

thirtiseven Jan 24, 2025

res-life Jan 24, 2025

thirtiseven Jan 24, 2025

thirtiseven Jan 26, 2025

res-life Jan 24, 2025 •

edited

Loading

thirtiseven Jan 24, 2025 •

edited

Loading

res-life Jan 24, 2025

thirtiseven Jan 24, 2025

res-life Jan 24, 2025

thirtiseven Jan 24, 2025

res-life Jan 24, 2025

thirtiseven Jan 24, 2025

res-life Jan 24, 2025

thirtiseven Jan 26, 2025

res-life commented Jan 26, 2025

thirtiseven commented Jan 26, 2025

thirtiseven commented Jan 26, 2025

sperlingxx left a comment

res-life left a comment

thirtiseven commented Jan 27, 2025

thirtiseven commented Jan 27, 2025

thirtiseven commented Jan 27, 2025

thirtiseven commented Jan 27, 2025


		with_cpu_session(lambda spark: spark.udf.register("udf_fallback", udf_fallback))

		assert_gpu_and_cpu_are_equal_collect(

HybridParquetScan: Refine filter push down to avoid double evaluation #12000

Are you sure you want to change the base?

HybridParquetScan: Refine filter push down to avoid double evaluation #12000

Conversation

thirtiseven commented Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sperlingxx left a comment • edited Loading

Choose a reason for hiding this comment

sperlingxx commented Jan 23, 2025

res-life Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

thirtiseven Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

res-life commented Jan 26, 2025

thirtiseven commented Jan 26, 2025

thirtiseven commented Jan 26, 2025

sperlingxx left a comment

Choose a reason for hiding this comment

res-life left a comment

Choose a reason for hiding this comment

thirtiseven commented Jan 27, 2025

thirtiseven commented Jan 27, 2025

thirtiseven commented Jan 27, 2025

thirtiseven commented Jan 27, 2025

thirtiseven commented Jan 22, 2025 •

edited

Loading

sperlingxx left a comment •

edited

Loading

res-life Jan 24, 2025 •

edited

Loading

res-life Jan 24, 2025 •

edited

Loading

thirtiseven Jan 24, 2025 •

edited

Loading