[QST] - Can this be forced to an int and not bigint? #5331

eyalhir74 · 2022-03-07T09:27:18Z

eyalhir74
Mar 7, 2022

I have the following code, which sums boolean/short values.
The sum operation seems to be casting it to bigint. Is that on purpose because even the summing of booleans/shorts can lead to an overflow on ints?
Is there some way to relax this constraint? I guess this incurs a performance hit on the GPU, forcing it to work with 64bit instead of 32bit, right?

val r = scala.util.Random
val data = Seq.fill(100000)(r.nextBoolean)
val columns = Seq("b")
val df1 = data.toDF(columns:_*)
df1.createOrReplaceTempView("df1")
val df2 = spark.sql("select if(b, 1s, 0s) as shrt from df1")
df2.createOrReplaceTempView("df2")
df2.printSchema()
root
 |-- shrt: short (nullable = false)
 
spark.sql("select sum(shrt) from df2").show()
22/03/07 09:11:46 WARN rapids.GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> sum(cast(shrt#2142 as bigint)) will run on GPU
    *Expression <Sum> sum(cast(shrt#2142 as bigint)) will run on GPU
      *Expression <Cast> cast(shrt#2142 as bigint) will run on GPU
  *Expression <Alias> cast(sum(cast(shrt#2142 as bigint))#2147L as string) AS sum(shrt)#2151 will run on GPU
    *Expression <Cast> cast(sum(cast(shrt#2142 as bigint))#2147L as string) will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_sum(cast(shrt#2142 as bigint)) will run on GPU
        *Expression <Sum> sum(cast(shrt#2142 as bigint)) will run on GPU
          *Expression <Cast> cast(shrt#2142 as bigint) will run on GPU
      !NOT_FOUND <LocalTableScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.LocalTableScanExec could be found
        @Expression <AttributeReference> shrt#2142 could run on GPU

+---------+
|sum(shrt)|
+---------+
|    49741|
+---------+

Answered by revans2

Mar 8, 2022

I guess reducing the amount of memory and ops, by handling shorts/bytes/ints and not long, helps the CPU as well.

Actually the CPU UnsafeRow format stores everything smaller than 64-bit in a 64-bit memory location. So it does not actually help with CPU memory utilization nearly as much as you would hope.

What do you mean in newer spark versions?

Apache Spark 3.2.0 and above stopped inserting in the explicit cast before the sum.

Should I use a newer one to get this change?

We have seen a lot of performance improvements in Spark 3.2.x. I would recommend checking it out mostly for things like DPP and AQE. The cast modification is a really a minor change. I would not worry about it. I j…

View full answer

revans2 · 2022-03-07T18:14:21Z

revans2
Mar 7, 2022
Maintainer

We are following exactly what Spark is doing in this case. I am not 100% sure why Spark does this, but I believe that the cast to a long is to try and avoid overflows. Please note that in newer versions of Spark the cast is removed, but the SUM is still done as a long and the output is still a long. Again I don't know 100% why this change was made in Spark, but it is a good thing for us. It helps reduce the memory usage on the GPU in the case you have. The number of rows in the output of an aggregation generally is smaller than the number of input rows, so if we don't have to cast the input byte to a long before doing the SUM, then it means we use less memory over all in the query.

0 replies

eyalhir74 · 2022-03-08T05:30:56Z

eyalhir74
Mar 8, 2022
Author

@revans2 I guess reducing the amount of memory and ops, by handling shorts/bytes/ints and not long, helps the CPU as well.
What do you mean in newer spark versions? I'm using 3.1.2 and RAPIDS Accelerator 21.12.0 using cudf 21.12.0
Should I use a newer one to get this change? Any idea if I should indeed see a performance gain?

0 replies

revans2 · 2022-03-08T13:14:15Z

revans2
Mar 8, 2022
Maintainer

I guess reducing the amount of memory and ops, by handling shorts/bytes/ints and not long, helps the CPU as well.

Actually the CPU UnsafeRow format stores everything smaller than 64-bit in a 64-bit memory location. So it does not actually help with CPU memory utilization nearly as much as you would hope.

What do you mean in newer spark versions?

Apache Spark 3.2.0 and above stopped inserting in the explicit cast before the sum.

Should I use a newer one to get this change?

We have seen a lot of performance improvements in Spark 3.2.x. I would recommend checking it out mostly for things like DPP and AQE. The cast modification is a really a minor change. I would not worry about it. I just wanted to give you as much info about your specific question as possible.

0 replies

sameerz · 2022-03-08T21:10:11Z

sameerz
Mar 8, 2022
Maintainer

Please re-open if you have any follow up questions @eyalhir74

0 replies

eyalhir74 · 2022-03-09T14:52:02Z

eyalhir74
Mar 9, 2022
Author

@revans2 upgraded to Spark 3.2.1 and RAPIDS Accelerator 22.02.0 using cudf 22.02.0.
Cast to bigint seemed indeed to have disappeared. Performance for this specific query, went up ~5-7%.
Still not good enough compared to the CPU though.

0 replies

revans2 · 2022-03-10T15:54:06Z

revans2
Mar 10, 2022
Maintainer

@eyalhir74 I would not expect this query to beat the CPU. The data is tiny (100,000 shorts) that is under 200KiB in CUDF and under about 1.5MiB in the Unsafe Row format. It is likely to all fit in the CPU cache. Also the result is a single long so really at this point you are just measuring the overhead of running a very small query, and for the GPU the amount of time it takes to translate row based data to columns.

For me with a 12 core CPU and a a6000 GPU I saw 3151 ms cold and 1275 ms hot for the GPU and 1187 ms cold and 1117 ms hot for the CPU. But if we switch the input format to parquet.

spark.range(100000).selectExpr("if (random() > 0.5, 1s, 0s) as shrt").write.mode("overwrite").parquet("/data/tmp/TMP_INPUT")
val df2 = spark.read.parquet("/data/tmp/TMP_INPUT")
spark.time(spark.sql("select sum(shrt) from df2").show())

I get 562 ms cold and 175 ms hot for the GPU with 498 ms cold and 106 ms hot for the CPU. Now if we scale the number of rows you can see at some point the GPU starts to win because of the large amount of data involved. But a simple small reduction is not likely to be something that the GPU is going to beat the CPU at without a huge amount of data and really great I/O. The computation involved is not taxing for the CPU, so it is hard for the GPU to pay for the overhead of moving the data to the GPU. With parquet it is moving compressed data so it is simpler, but still not great until we get to much larger amounts of data. In fact the time to do the SUM is so small this is more of a parquet decoder test, than it is a test of aggregation speed. We are working to improve our parquet decoder for cases like this, because we know it is not as good as it could be.

Num Rows	GPU Cold	GPU Hot	CPU Cold	CPU Hot
100,000	562	175	498	106
10,000,000	925	187	519	110
100,000,000	1,073	183	564	179
1,000,000,000	1,018	275	1,028	584
10,000,000,000	2,229	1,186	4,986	4,623
100,000,000,000	10,830	9,556	43,239	42,341

0 replies

eyalhir74 · 2022-03-10T21:19:24Z

eyalhir74
Mar 10, 2022
Author

@revans2 Thanks for a very detailed explaination!
I wasn't clear in the previous message. As you've suggested, I've tested Spark 3.2.1 on the real production data/query, where I saw the 5-7% performance gain (and no more short to bigint cast). The above query was just a sample to reproduce the short to bigint cast.

However, a small follow-up regarding the numbers you've posted.
So for a 1B rows you'd only get a ~x2 factor for the A6000 (using how many CPU threads?) compared to a 12 Cores CPU? So a x2 factor would not be worth the hassle to move to the GPU right? is that because the compute op for the GPU is small relatively to the parquet/pci/kernel launch/etc overheads?
What would you change/add in the query to make the GPU really out-perform the CPU? more sums on more fields? multipling the data by a factor of 100 "only" increased the CPU-GPU ratio to four. Still not sure its that great no?

0 replies

revans2 · 2022-03-10T21:46:56Z

revans2
Mar 10, 2022
Maintainer

My CPU is rather old and consumer grade. It is a 6 core 12 thread core i7-7800x. I was using 12 tasks in local mode for the test. The point of the test really was about scaling. The absolute numbers should not matter, especially because this is a place where you should get very close to linear scaling with both the CPU and the GPU.

So a x2 factor would not be worth the hassle to move to the GPU right?

100% right for this case. That is kind of my point. Some operations don't see a lot of speedup by going to the GPU, or need a massive amount of data to see any kind of speed up. This is one of those cases where the cost benefit is just not there.

Sadly I don't see anywhere in our docs where we call out the operators we are good at and the ones that we are less good at. The following is from a presentation we did a while ago. The first two things on the list are called out here.

What we are not great at:

Small amounts of data
Highly cache coherent processing
Slow distributed filesystem
Slow local disks or network for shuffle
CPU/GPU transitions on many rows
User Defined Functions (UDFs) not on GPU

What we are great at:

High cardinality processing
- Joins
- Aggregations
- Sort
Window operations, especially large windows
Complicated processing
Transcoding data formats
- Encoding Parquet and ORC
- Reading CSV

There is also an talk at GTC where @viadea goes over some micro-benchmarks to give you a good idea of some types of things we are really good at.

I don't really want to play games with making up benchmarks. I would rather work with you on your real queries. If they are not great we can help debug what is going on and hopefully make them cost effective. But this is the real world and not a marketing brochure. Some queries will not be cost effective in the short term and possibly will never be, just because of how the hardware works. We want to be at a point where overall for all of the queries we are good enough that you feel good just turning it on. I think we are there for anything that does not use lists or maps heavily. We are working on them, but no ETA yet on when we will really be there.

0 replies

eyalhir74 · 2022-03-11T06:58:22Z

eyalhir74
Mar 11, 2022
Author

Thanks @revans2. As discussed with @viadea and @jlowe yesterday, we will prepare some data/queries for our heaviest queries and try together to figure out why they currently don't scale well for us on the GPU.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] - Can this be forced to an int and not bigint? #5331

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[QST] - Can this be forced to an int and not bigint? #5331

eyalhir74 Mar 7, 2022

Replies: 9 comments

revans2 Mar 7, 2022 Maintainer

eyalhir74 Mar 8, 2022 Author

revans2 Mar 8, 2022 Maintainer

sameerz Mar 8, 2022 Maintainer

eyalhir74 Mar 9, 2022 Author

revans2 Mar 10, 2022 Maintainer

eyalhir74 Mar 10, 2022 Author

revans2 Mar 10, 2022 Maintainer

eyalhir74 Mar 11, 2022 Author

eyalhir74
Mar 7, 2022

revans2
Mar 7, 2022
Maintainer

eyalhir74
Mar 8, 2022
Author

revans2
Mar 8, 2022
Maintainer

sameerz
Mar 8, 2022
Maintainer

eyalhir74
Mar 9, 2022
Author

revans2
Mar 10, 2022
Maintainer

eyalhir74
Mar 10, 2022
Author

revans2
Mar 10, 2022
Maintainer

eyalhir74
Mar 11, 2022
Author