-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue on TPCH comparing with sparksql #423
Comments
Just in case it might be helpful, here is a log snippet related to the serialization overhead of spark execution with wayang I mentioned above:
|
Thanks for all the info @wangxiaoying. I can take a look at it next week. In the meantime, can you confirm that the operations executed in postgres and in Spark with SparkSQL are the same when executed in Wayang? In Wayang you can actually force each operator where to be executed with the .withPlatform() method so that you make sure that the plans are the same. |
Thank you @zkaoudi for the quick response!
Yes. I checked the log at postgres side. Here is the query fetching related log when using wayang:
And this is the log when using sparksql:
In general, postgres does similar computation under the two setups. It seems like sparksql would generate additional filters with "IS NOT NULL", but it won't really filter our any data since the TPC-H dataset does not contain NULL value. In addition it didn't pushdown the P.S.
I think it is due to the racing on the This can be one of the reason for the performance difference, but I think the later execution difference inside spark platform is more significant in terms of the whole query. |
Hi @wangxiaoying, before digging into the details and just to make sure we are comparing same installations, I was wondering whether you are using the same Spark version for both runs. |
Hi again, I would suggest two things to check:
|
The performance difference also stems from the current implementation of the Postgres To Spark channel conversion; Line 46 in e891913
|
Sorry for the late reply, I was out for last week. Yes, I can confirm this. |
Yes, I think the executed join algorithms are different from the two approaches. Below is the default physical plan generated by spark:
I tried to add config:
Spark uses lazy evaluation so the view creation does not take much time (only some metadata will be fetched). And as I have shown above in the postgres log, spark does fetch the three tables (with projection and filter pushdown) during runtime like wayang does. I still think one key difference is task serialization, where wayang creates much larger spark tasks (>40MB) that makes serialization overhead no longer negligible, but I'm not sure why such big tasks are created. |
Thanks @wangxiaoying. I guess the broadcast join reduces the amount of data shuffled for this specific dataset/query. Could you disable the broadcast join in Spark to make sure if the difference comes from the join only? |
Hi @zkaoudi , I set the config of
The performance does not change much (still ~40s). |
To go around the potential performance issue of the Postgres To Spark channel conversion you could add the Java platform and see what you get. |
Thanks @zkaoudi for the suggestion. Adding java improves the overall performance (since the final plan won't involve the usage of spark). It results in 44 seconds (similar but still a little bit inferior comparing to sparksql (40 seconds). |
Description
I'm trying to run TPC-H Q3 and compare the performance between Wayang and SparkSQL under the following setup:
I try to keep the spark setting the same on both runs. And for Q3 wayang took around 3 minutes while spark took only 40 seconds.
To reproduce
To run Wayang, I compile the project locally (using tag 1.0.0) and use the benchmark code under
wayang-benchmark
directly:./wayang-1.0.0-SNAPSHOT/bin/wayang-submit org.apache.wayang.apps.tpch.TpcH exp\(123\) spark,postgres file:///path/to/wayang.properties Q3
The wayang.properties file is like the following:
To run Spark, I use the following code:
Some investigation
The queries that are used to fetch data from postgres using both platforms, which are basically the same (filter and projection pushdown are enabled).
I try to print the logs of spark execution as much as I can to see the difference between the two. One significant overhead I found is that wayang produces much larger
ShuffleMapTask
for join than spark does (~46500000 bytes v.s. 8000 bytes), which causes ~2 seconds to serialize each task (64 tasks in total) one by one and result in a 1 minutes overhead. On the other hand, the serialization time on spark is negligible.I'm not very familiar with spark execution, so I'm not sure why it is the case. Can anyone give me a pointer? Is there anything I'm missing such as the way I run the query or something in configuration? Thank you!
The text was updated successfully, but these errors were encountered: