Update ClickBench benchmarks with DataFusion `44.0.0` #13983

alamb · 2025-01-02T13:02:33Z

Is your feature request related to a problem or challenge?

Like Update ClickBench benchmarks with DataFusion 43.0.0 #13099
Related to Release DataFusion 44.0.0 #13334

Describe the solution you'd like

Now that DataFusion 44.0.0 is released, It would be great to update ClickBench https://benchmark.clickhouse.com/ with the newest version

ClickBench is a benchmark heavy on filtering and aggregation that we have used as an optimization target for the last several releases.

Describe alternatives you've considered

Here is the PR that @pmcgleenon made for DataFusion 43.0.0: updated for datafusion release 43.0.0 ClickHouse/ClickBench#251

Additional context

I am especially interested to see the improvements after the vectorized comparison from @Rachelint @jayzhan211 @Dandandan and others in DataFusion aggregate code

Support vectorized append and compare for multi group by #12996

The text was updated successfully, but these errors were encountered:

alamb · 2025-01-23T09:11:05Z

Someone pointed out to me the other day that DataFusion 43 is no longer on top of the ClickBench Parquet Leaderboard

(actually it was one of the people who has spent substntial time optimizing Hyper...)

Thus I think it is that much more valuable to get some DataFusion 44 numbers on the board

alamb · 2025-01-23T09:21:58Z

I also filed a ticket to track running clickbench on DataFusion 45 once that is released in a few weeks

Update ClickBench benchmarks with DataFusion 45.0.0 (When Published) #14246

Rachelint · 2025-01-24T06:13:43Z

I think Q8, Q16~18, Q35 can be closer to hyper in 44.0, they are improved in #12996
And Q35 can be even much faster when #13617 is merged (unfortunately, it can just be released in 46.0 for my long delay recently...)

But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it.

alamb · 2025-01-24T08:57:57Z

I think Q8, Q16~18, Q35 can be closer to hyper in 44.0, they are improved in #12996 And Q35 can be even much faster when #13617 is merged (unfortunately, it can just be released in 46.0 for my long delay recently...)

But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it.

I agree -- in case anyone else wants to see hyper reported 5x faster than DataFusion and 6x faster than DuckDB

I think this is Q23

datafusion/benchmarks/queries/clickbench/queries.sql

Line 22 in 11b7b5c

    
           SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

Profiling it like this:

$ datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), MIN(\"Title\"), COUNT(*) AS c, COUNT(DISTINCT \"UserID\") FROM hits_partitioned WHERE \"Title\" LIKE '%Google%' AND \"URL\" NOT LIKE '%.google.%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"

26% of the time goes to snappy decompression and 40% of the time to utf8 validation:

Here is the full flamegraph.svg

So by my calculations the snappy decompression time alone in DataFusion (0.26 * 10.28s = 2.6s) takes longer than the hyper reported time of 1.8s 😕

alamb · 2025-01-24T08:59:10Z

If we wanted to juice our numbers we could turn off ut8 validation too but I feel like that would be cheating (as most/many systems would never run without validtion on)

Dandandan · 2025-01-24T09:18:11Z

Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that.

alamb · 2025-01-24T09:31:37Z

Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that.

Running without filter pushdown (the default)

set datafusion.execution.parquet.pushdown_filters = false;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

I get:

Elapsed 2.232 seconds.
Elapsed 2.252 seconds.
Elapsed 2.236 seconds.

When I enabled filter pushdown it goes 15% faster.

set datafusion.execution.parquet.pushdown_filters = true;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

I get:
Elapsed 1.981 seconds.
Elapsed 1.953 seconds.
Elapsed 1.966 seconds.

Still not 5x though 🤔

Though it gives me new motivation tohelp @XiangpengHao get the pushdown improvements over the line in

apache/arrow-rs#6921

Rachelint · 2025-01-24T09:47:32Z

@alamb 🤔 Q23 seems to be SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10 ?

alamb · 2025-01-24T10:28:19Z

@alamb 🤔 Q23 seems to be SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10 ?

🤔 you are right indeed 🤦 -- sorry about that (I went the wrong direction)

datafusion/benchmarks/queries/clickbench/queries.sql

Line 24 in 11b7b5c

    
           SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

I will profile that and report back

alamb · 2025-01-24T10:33:37Z

And in this case enabling predicate pushdown results in a 2x speedup

set datafusion.execution.parquet.pushdown_filters = false;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Elapsed 4.108 seconds.
Elapsed 5.430 seconds.
Elapsed 4.659 seconds.

set datafusion.execution.parquet.pushdown_filters = true;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Elapsed 2.415 seconds.
Elapsed 2.070 seconds.
Elapsed 2.279 seconds.

Here is the flamegraph for no pushdown:

It would be cool to test with @XiangpengHao 's change to the parquet decoder here:

Experimental parquet decoder with first-class selection pushdown support arrow-rs#6921

Dandandan · 2025-01-24T10:48:06Z

It seems it could also benefit from some further utf8 validation speed up, filed it here apache/arrow-rs#7014

Rachelint · 2025-01-24T11:29:42Z

@alamb Excited to see further optmization about late materialization, it is really an important feature as I think !
I tried to use it in HoraeDB last year, and found the same problem mentioned in #6921 and it is frustrated...

I will profile again with setting datafusion.execution.parquet.pushdown_filters = true;, and see what optimizations we can do in datafusion.

alamb · 2025-01-24T21:05:37Z

@alamb Excited to see further optmization about late materialization, it is really an important feature as I think ! I tried to use it in HoraeDB last year, and found the same problem mentioned in #6921 and it is frustrated...

I will profile again with setting datafusion.execution.parquet.pushdown_filters = true;, and see what optimizations we can do in datafusion.

Thanks @Rachelint

For this case I believe the core change needs to happen in the Parquet reader. The background as I understand it is described here

Adaptive Parquet Predicate Pushdown arrow-rs#5523

@XiangpengHao has a prototype in the following PR

Experimental parquet decoder with first-class selection pushdown support arrow-rs#6921

A good next step would be to measure how much faster DataFusion is with that PR -- the previous measurements we had a few other optimizations mixed in.

Rachelint · 2025-01-25T17:17:25Z

Here is the flamegraph for pushdown, 36+% time costs in decompression.

alamb added the enhancement New feature or request label Jan 2, 2025

This was referenced Jan 2, 2025

Update ClickBench benchmarks with DataFusion 43.0.0 #13099

Closed

Release DataFusion 44.0.0 #13334

Closed

alamb mentioned this issue Jan 23, 2025

Update ClickBench benchmarks with DataFusion 45.0.0 (When Published) #14246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ClickBench benchmarks with DataFusion `44.0.0` #13983

Update ClickBench benchmarks with DataFusion `44.0.0` #13983

alamb commented Jan 2, 2025

alamb commented Jan 23, 2025

alamb commented Jan 23, 2025

Rachelint commented Jan 24, 2025 •

edited

Loading

alamb commented Jan 24, 2025

alamb commented Jan 24, 2025

Dandandan commented Jan 24, 2025

alamb commented Jan 24, 2025

Rachelint commented Jan 24, 2025 •

edited

Loading

alamb commented Jan 24, 2025

alamb commented Jan 24, 2025

Dandandan commented Jan 24, 2025

Rachelint commented Jan 24, 2025 •

edited

Loading

alamb commented Jan 24, 2025

Rachelint commented Jan 25, 2025

Update ClickBench benchmarks with DataFusion 44.0.0 #13983

Update ClickBench benchmarks with DataFusion 44.0.0 #13983

Comments

alamb commented Jan 2, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Jan 23, 2025

alamb commented Jan 23, 2025

Rachelint commented Jan 24, 2025 • edited Loading

alamb commented Jan 24, 2025

alamb commented Jan 24, 2025

Dandandan commented Jan 24, 2025

alamb commented Jan 24, 2025

Rachelint commented Jan 24, 2025 • edited Loading

alamb commented Jan 24, 2025

alamb commented Jan 24, 2025

Dandandan commented Jan 24, 2025

Rachelint commented Jan 24, 2025 • edited Loading

alamb commented Jan 24, 2025

Rachelint commented Jan 25, 2025

Update ClickBench benchmarks with DataFusion `44.0.0` #13983

Update ClickBench benchmarks with DataFusion `44.0.0` #13983

Rachelint commented Jan 24, 2025 •

edited

Loading

Rachelint commented Jan 24, 2025 •

edited

Loading

Rachelint commented Jan 24, 2025 •

edited

Loading