Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClickBench benchmarks with DataFusion 44.0.0 #13983

Open
alamb opened this issue Jan 2, 2025 · 14 comments
Open

Update ClickBench benchmarks with DataFusion 44.0.0 #13983

alamb opened this issue Jan 2, 2025 · 14 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 2, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Now that DataFusion 44.0.0 is released, It would be great to update ClickBench https://benchmark.clickhouse.com/ with the newest version

ClickBench is a benchmark heavy on filtering and aggregation that we have used as an optimization target for the last several releases.

Describe alternatives you've considered

Additional context

I am especially interested to see the improvements after the vectorized comparison from @Rachelint @jayzhan211 @Dandandan and others in DataFusion aggregate code

@alamb alamb added the enhancement New feature or request label Jan 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

Someone pointed out to me the other day that DataFusion 43 is no longer on top of the ClickBench Parquet Leaderboard

Image

(actually it was one of the people who has spent substntial time optimizing Hyper...)

Thus I think it is that much more valuable to get some DataFusion 44 numbers on the board

@alamb
Copy link
Contributor Author

alamb commented Jan 23, 2025

I also filed a ticket to track running clickbench on DataFusion 45 once that is released in a few weeks

@Rachelint
Copy link
Contributor

Rachelint commented Jan 24, 2025

I think Q8, Q16~18, Q35 can be closer to hyper in 44.0, they are improved in #12996
And Q35 can be even much faster when #13617 is merged (unfortunately, it can just be released in 46.0 for my long delay recently...)

But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

I think Q8, Q16~18, Q35 can be closer to hyper in 44.0, they are improved in #12996 And Q35 can be even much faster when #13617 is merged (unfortunately, it can just be released in 46.0 for my long delay recently...)

But Q23 is unbelievalbely fast in hyper... I think we may need to profile and think how can we improve it.

I agree -- in case anyone else wants to see hyper reported 5x faster than DataFusion and 6x faster than DuckDB

Image

I think this is Q23

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

Profiling it like this:

$ datafusion-cli -c "SELECT \"SearchPhrase\", MIN(\"URL\"), MIN(\"Title\"), COUNT(*) AS c, COUNT(DISTINCT \"UserID\") FROM hits_partitioned WHERE \"Title\" LIKE '%Google%' AND \"URL\" NOT LIKE '%.google.%' AND \"SearchPhrase\" <> '' GROUP BY \"SearchPhrase\" ORDER BY c DESC LIMIT 10;"

26% of the time goes to snappy decompression and 40% of the time to utf8 validation:

Image

Here is the full flamegraph.svg

So by my calculations the snappy decompression time alone in DataFusion (0.26 * 10.28s = 2.6s) takes longer than the hyper reported time of 1.8s 😕

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

If we wanted to juice our numbers we could turn off ut8 validation too but I feel like that would be cheating (as most/many systems would never run without validtion on)

@Dandandan
Copy link
Contributor

Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

Q23 might be improved if it can utilize filter pushdown? I think a >5x improvement might come from that.

Running without filter pushdown (the default)

set datafusion.execution.parquet.pushdown_filters = false;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

I get:

Elapsed 2.232 seconds.
Elapsed 2.252 seconds.
Elapsed 2.236 seconds.

When I enabled filter pushdown it goes 15% faster.

set datafusion.execution.parquet.pushdown_filters = true;

SELECT "SearchPhrase", MIN("URL"), MIN("Title"), COUNT(*) AS c, COUNT(DISTINCT "UserID") FROM hits_partitioned WHERE "Title" LIKE '%Google%' AND "URL" NOT LIKE '%.google.%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

I get:
Elapsed 1.981 seconds.
Elapsed 1.953 seconds.
Elapsed 1.966 seconds.

Still not 5x though 🤔

Though it gives me new motivation tohelp @XiangpengHao get the pushdown improvements over the line in

apache/arrow-rs#6921

@Rachelint
Copy link
Contributor

Rachelint commented Jan 24, 2025

@alamb 🤔 Q23 seems to be SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10 ?

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

@alamb 🤔 Q23 seems to be SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10 ?

🤔 you are right indeed 🤦 -- sorry about that (I went the wrong direction)

SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

I will profile that and report back

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

And in this case enabling predicate pushdown results in a 2x speedup

set datafusion.execution.parquet.pushdown_filters = false;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Elapsed 4.108 seconds.
Elapsed 5.430 seconds.
Elapsed 4.659 seconds.

set datafusion.execution.parquet.pushdown_filters = true;
SELECT * FROM hits_partitioned WHERE "URL" LIKE '%google%' ORDER BY to_timestamp_seconds("EventTime") LIMIT 10;

Elapsed 2.415 seconds.
Elapsed 2.070 seconds.
Elapsed 2.279 seconds.

Here is the flamegraph for no pushdown:

Image

It would be cool to test with @XiangpengHao 's change to the parquet decoder here:

@Dandandan
Copy link
Contributor

It seems it could also benefit from some further utf8 validation speed up, filed it here apache/arrow-rs#7014

@Rachelint
Copy link
Contributor

Rachelint commented Jan 24, 2025

@alamb Excited to see further optmization about late materialization, it is really an important feature as I think !
I tried to use it in HoraeDB last year, and found the same problem mentioned in #6921 and it is frustrated...

I will profile again with setting datafusion.execution.parquet.pushdown_filters = true;, and see what optimizations we can do in datafusion.

@alamb
Copy link
Contributor Author

alamb commented Jan 24, 2025

@alamb Excited to see further optmization about late materialization, it is really an important feature as I think ! I tried to use it in HoraeDB last year, and found the same problem mentioned in #6921 and it is frustrated...

I will profile again with setting datafusion.execution.parquet.pushdown_filters = true;, and see what optimizations we can do in datafusion.

Thanks @Rachelint

For this case I believe the core change needs to happen in the Parquet reader. The background as I understand it is described here

@XiangpengHao has a prototype in the following PR

A good next step would be to measure how much faster DataFusion is with that PR -- the previous measurements we had a few other optimizations mixed in.

@Rachelint
Copy link
Contributor

Here is the flamegraph for pushdown, 36+% time costs in decompression.

profile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants