-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ClickBench benchmarks with DataFusion 44.0.0
#13983
Comments
Someone pointed out to me the other day that DataFusion 43 is no longer on top of the ClickBench Parquet Leaderboard (actually it was one of the people who has spent substntial time optimizing Hyper...) Thus I think it is that much more valuable to get some DataFusion 44 numbers on the board |
I also filed a ticket to track running clickbench on DataFusion 45 once that is released in a few weeks |
I think Q8, Q16~18, Q35 can be closer to But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it. |
I agree -- in case anyone else wants to see hyper reported 5x faster than DataFusion and 6x faster than DuckDB I think this is Q23
SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; Profiling it like this: $ datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), MIN(\"Title\"), COUNT(*) AS c, COUNT(DISTINCT \"UserID\") FROM hits_partitioned WHERE \"Title\" LIKE '%Google%' AND \"URL\" NOT LIKE '%.google.%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;" 26% of the time goes to snappy decompression and 40% of the time to utf8 validation: Here is the full flamegraph.svg So by my calculations the snappy decompression time alone in DataFusion (0.26 * 10.28s = 2.6s) takes longer than the hyper reported time of 1.8s 😕 |
If we wanted to juice our numbers we could turn off ut8 validation too but I feel like that would be cheating (as most/many systems would never run without validtion on) |
Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that. |
Running without filter pushdown (the default) set datafusion.execution.parquet.pushdown_filters = false;
SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; I get: Elapsed 2.232 seconds. When I enabled filter pushdown it goes 15% faster. set datafusion.execution.parquet.pushdown_filters = true;
SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10; I get: Still not 5x though 🤔 Though it gives me new motivation tohelp @XiangpengHao get the pushdown improvements over the line in |
@alamb 🤔 Q23 seems to be |
🤔 you are right indeed 🤦 -- sorry about that (I went the wrong direction)
I will profile that and report back |
And in this case enabling predicate pushdown results in a 2x speedup set datafusion.execution.parquet.pushdown_filters = false;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10; Elapsed 4.108 seconds. set datafusion.execution.parquet.pushdown_filters = true;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10; Elapsed 2.415 seconds. Here is the flamegraph for no pushdown: It would be cool to test with @XiangpengHao 's change to the parquet decoder here: |
It seems it could also benefit from some further utf8 validation speed up, filed it here apache/arrow-rs#7014 |
@alamb Excited to see further optmization about I will profile again with setting |
Thanks @Rachelint For this case I believe the core change needs to happen in the Parquet reader. The background as I understand it is described here @XiangpengHao has a prototype in the following PR A good next step would be to measure how much faster DataFusion is with that PR -- the previous measurements we had a few other optimizations mixed in. |
Is your feature request related to a problem or challenge?
43.0.0
#1309944.0.0
#13334Describe the solution you'd like
Now that DataFusion 44.0.0 is released, It would be great to update ClickBench https://benchmark.clickhouse.com/ with the newest version
ClickBench is a benchmark heavy on filtering and aggregation that we have used as an optimization target for the last several releases.
Describe alternatives you've considered
Additional context
I am especially interested to see the improvements after the vectorized comparison from @Rachelint @jayzhan211 @Dandandan and others in DataFusion aggregate code
The text was updated successfully, but these errors were encountered: