-
I have the following code, which sums boolean/short values.
|
Beta Was this translation helpful? Give feedback.
Replies: 9 comments
-
We are following exactly what Spark is doing in this case. I am not 100% sure why Spark does this, but I believe that the cast to a long is to try and avoid overflows. Please note that in newer versions of Spark the cast is removed, but the SUM is still done as a long and the output is still a long. Again I don't know 100% why this change was made in Spark, but it is a good thing for us. It helps reduce the memory usage on the GPU in the case you have. The number of rows in the output of an aggregation generally is smaller than the number of input rows, so if we don't have to cast the input byte to a long before doing the SUM, then it means we use less memory over all in the query. |
Beta Was this translation helpful? Give feedback.
-
@revans2 I guess reducing the amount of memory and ops, by handling shorts/bytes/ints and not long, helps the CPU as well. |
Beta Was this translation helpful? Give feedback.
-
Actually the CPU UnsafeRow format stores everything smaller than 64-bit in a 64-bit memory location. So it does not actually help with CPU memory utilization nearly as much as you would hope.
Apache Spark 3.2.0 and above stopped inserting in the explicit cast before the sum.
We have seen a lot of performance improvements in Spark 3.2.x. I would recommend checking it out mostly for things like DPP and AQE. The cast modification is a really a minor change. I would not worry about it. I just wanted to give you as much info about your specific question as possible. |
Beta Was this translation helpful? Give feedback.
-
Please re-open if you have any follow up questions @eyalhir74 |
Beta Was this translation helpful? Give feedback.
-
@revans2 upgraded to Spark 3.2.1 and RAPIDS Accelerator 22.02.0 using cudf 22.02.0. |
Beta Was this translation helpful? Give feedback.
-
@eyalhir74 I would not expect this query to beat the CPU. The data is tiny (100,000 shorts) that is under 200KiB in CUDF and under about 1.5MiB in the Unsafe Row format. It is likely to all fit in the CPU cache. Also the result is a single long so really at this point you are just measuring the overhead of running a very small query, and for the GPU the amount of time it takes to translate row based data to columns. For me with a 12 core CPU and a a6000 GPU I saw 3151 ms cold and 1275 ms hot for the GPU and 1187 ms cold and 1117 ms hot for the CPU. But if we switch the input format to parquet.
I get 562 ms cold and 175 ms hot for the GPU with 498 ms cold and 106 ms hot for the CPU. Now if we scale the number of rows you can see at some point the GPU starts to win because of the large amount of data involved. But a simple small reduction is not likely to be something that the GPU is going to beat the CPU at without a huge amount of data and really great I/O. The computation involved is not taxing for the CPU, so it is hard for the GPU to pay for the overhead of moving the data to the GPU. With parquet it is moving compressed data so it is simpler, but still not great until we get to much larger amounts of data. In fact the time to do the SUM is so small this is more of a parquet decoder test, than it is a test of aggregation speed. We are working to improve our parquet decoder for cases like this, because we know it is not as good as it could be.
|
Beta Was this translation helpful? Give feedback.
-
@revans2 Thanks for a very detailed explaination! However, a small follow-up regarding the numbers you've posted. |
Beta Was this translation helpful? Give feedback.
-
My CPU is rather old and consumer grade. It is a 6 core 12 thread core i7-7800x. I was using 12 tasks in local mode for the test. The point of the test really was about scaling. The absolute numbers should not matter, especially because this is a place where you should get very close to linear scaling with both the CPU and the GPU.
100% right for this case. That is kind of my point. Some operations don't see a lot of speedup by going to the GPU, or need a massive amount of data to see any kind of speed up. This is one of those cases where the cost benefit is just not there. Sadly I don't see anywhere in our docs where we call out the operators we are good at and the ones that we are less good at. The following is from a presentation we did a while ago. The first two things on the list are called out here. What we are not great at:
What we are great at:
There is also an talk at GTC where @viadea goes over some micro-benchmarks to give you a good idea of some types of things we are really good at. I don't really want to play games with making up benchmarks. I would rather work with you on your real queries. If they are not great we can help debug what is going on and hopefully make them cost effective. But this is the real world and not a marketing brochure. Some queries will not be cost effective in the short term and possibly will never be, just because of how the hardware works. We want to be at a point where overall for all of the queries we are good enough that you feel good just turning it on. I think we are there for anything that does not use lists or maps heavily. We are working on them, but no ETA yet on when we will really be there. |
Beta Was this translation helpful? Give feedback.
-
Thanks @revans2. As discussed with @viadea and @jlowe yesterday, we will prepare some data/queries for our heaviest queries and try together to figure out why they currently don't scale well for us on the GPU. |
Beta Was this translation helpful? Give feedback.
Actually the CPU UnsafeRow format stores everything smaller than 64-bit in a 64-bit memory location. So it does not actually help with CPU memory utilization nearly as much as you would hope.
Apache Spark 3.2.0 and above stopped inserting in the explicit cast before the sum.
We have seen a lot of performance improvements in Spark 3.2.x. I would recommend checking it out mostly for things like DPP and AQE. The cast modification is a really a minor change. I would not worry about it. I j…