Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casting from Binary --> Utf8 to evaluate LIKE slows down some ClickBench queries #12509

Open
Tracked by #11752
alamb opened this issue Sep 17, 2024 · 1 comment
Open
Tracked by #11752
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Sep 17, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

While working on enabling StringView by default in #12092 I noticed that some of the clickbench queries got 10% slower and looked into it.

The plan looks like this:

DataFusion CLI v41.0.0
+---------------+----------------------------------------------------------------------------------------------------$
| plan_type     | plan                                                                                               $
+---------------+----------------------------------------------------------------------------------------------------$
| physical_plan | AggregateExec: mode=Final, gby=[], aggr=[count(*)]                                                 $
|               |   CoalescePartitionsExec                                                                           $
|               |     AggregateExec: mode=Partial, gby=[], aggr=[count(*)]                                           $
|               |       ProjectionExec: expr=[]                                                                      $
|               |         CoalesceBatchesExec: target_batch_size=8192                                                $
|               |           FilterExec: CAST(URL@0 AS Utf8View) LIKE %google%                                        $
|               |             ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Software/datafusion/benchmarks/$
+---------------+----------------------------------------------------------------------------------------------------$
2 row(s) fetched.
Elapsed 0.065 seconds.

When looking at the flamegraphs, you can see the CAST spends a huge amount of time validating utf8 (more time than actually evaluating the LIKE predicate actually):
Screenshot 2024-09-17 at 11 06 34 AM

Here are the full flamegraphs for comparison:
q20-flamegraph-main
q20-flamegraph-stringview

I belive the issue is here:

|               |           FilterExec: CAST(URL@0 AS Utf8View) LIKE %google%

This filter first *CASTs the URL column to Utf8View and then evaluates LIKE`

Converting BinaryArray --> StringArrayas is done without StringView is relatively faster because it is done with a single large function call

However, converting BinaryViewArrar --> StringViewArray is not as it makes many small function calls. The parquet reader has a special optimization for this as descsribed in "Section 2.1: From binary to strings" of the Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet from @XiangpengHao

Describe the solution you'd like
I would like this query to go as fast / faster with Utf8View / BinaryView enabled.

Bonus points if it went faster even without Utf-8 enabled

Describe alternatives you've considered

Option 1: LIKE for binary

One option is to skip validating UTF8 entirely and evaluate LIKE directly on binary. This would mean if the column is read as binary we could cast the argument '%google%' to binary and then evaluate LIKE directly on the binary column. This would skip validaitng utf8 completely

Unfortunately, it appears that the like kernel is only implemented for StringArray and StringViewArray at the moment, not BinaryArray: https://docs.rs/arrow-string/53.0.0/src/arrow_string/like.rs.html#110-149

Another related option would be to potentially special case the LIKE rewite in this case for just prefix / contians / suffix -- in this case rewrite <binary> LIKE <const that starts and ends with '%'> --> <binary> CONTAINS <string>

Option 2: resolve the column as Utf8 rather than Binary

For some reason the schema of hits.parquet (the single file from ClickBench) has the URL column (and others) as Utf8 (strings) but the hits_partitioned file resolves it as Binary. \

We could change the schema resolution logicic to resolve the column as a String instead.

This option is probably slower than option 1 but I think it is more inline with what the intended semantics (these columns contain logical stirngs) and the parquet reader includes the fast read path for such strings and would be more general.

Filed #12510 to track this ideae

Additional context

@alamb
Copy link
Contributor Author

alamb commented Sep 22, 2024

I have been thinking about this, and I came up with a third option which is to "push the casting into the scan"

Consider this plan for q28:

+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
| plan_type     | plan                                                                                                                                                                                                                                                                                                             $
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
| logical_plan  | Sort: l DESC NULLS FIRST, fetch=25                                                                                                                                                                                                                                                                               $
|               |   Projection: regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1")) AS k, avg(character_length(hits_partitioned.Referer)) AS l, count(*) AS c, min(hits_partitioned.Referer)                                                                                               $
|               |     Filter: count(*) > Int64(100000)                                                                                                                                                                                                                                                                             $
|               |       Aggregate: groupBy=[[regexp_replace(__common_expr_1 AS hits_partitioned.Referer, Utf8("^https?://(?:www\.)?([^/]+)/.*$"), Utf8("\1"))]], aggr=[[avg(CAST(character_length(__common_expr_1 AS hits_partitioned.Referer) AS Float64)), count(Int64(1)) AS count(*), min(hits_partitioned.Referer)]]          $
|               |         Projection: CAST(hits_partitioned.Referer AS Utf8) AS __common_expr_1, hits_partitioned.Referer                                                                                                                                                                                                          $
|               |           Filter: hits_partitioned.Referer != BinaryView("")                                                                                                                                                                                                                                                     $
|               |             TableScan: hits_partitioned projection=[Referer], partial_filters=[hits_partitioned.Referer != BinaryView("")]  

The

 Projection: CAST(hits_partitioned.Referer AS Utf8) AS __common_expr_1, hits_partitioned.Referer

Is what is causing a non trivial slowdown.

The issue is that hits_partitioned.Referer is read as a BinaryView

The problem is that BinaryView --> Utf8View conversion is much slower than reading Utf8View directly out of the parquet file due to the Utf8 optimization described by @XiangpengHao in "Section 2.1: From binary to strings" of the string view blog.

Option 3: Implement push down casting (maybe as an Analyzer rule??)

The theory here is that some readers( such as the parquet reader) can produce the data more effiicently in a particular format than creating it first in one format before datafusion casts it to another.

So the plan above would basically push the cast down so the parquet reader read the hits_partitioned.Referer as a Utf8View to begin with

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant