-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix collection_ops_tests
for Spark 4.0 [databricks]
#11414
base: branch-24.12
Are you sure you want to change the base?
Fix collection_ops_tests
for Spark 4.0 [databricks]
#11414
Conversation
Fixes NVIDIA#11011. This commit fixes the failures in `collection_ops_tests` on Spark 4.0. On all versions of Spark, when a Sequence is collected with rows that exceed MAX_INT, an exception is thrown indicating that the collected Sequence/array is larger than permissible. The different versions of Spark vary in the contents of the exception message. On Spark 4, one sees that the error message now contains more information than all prior versions, including: 1. The name of the op causing the error 2. The errant sequence size This commit introduces a shim to make this new information available in the exception. Note that this shim does not fit cleanly in RapidsErrorUtils, because there are differences within major Spark versions. For instance, Spark 3.4.0-1 have a different message as compared to 3.4.2 and 3.4.3. Likewise, the differences in 3.5.0, 3.5.1, 3.5.2. Signed-off-by: MithunR <[email protected]>
5606503
to
b8bd960
Compare
Build |
@@ -17,6 +17,8 @@ | |||
from asserts import assert_gpu_and_cpu_are_equal_collect, assert_gpu_and_cpu_error | |||
from data_gen import * | |||
from pyspark.sql.types import * | |||
|
|||
from src.main.python.spark_session import is_before_spark_400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To be consistent with other files, this should just be
from src.main.python.spark_session import is_before_spark_400 | |
from spark_session import is_before_spark_400 |
@@ -42,7 +42,12 @@ import com.nvidia.spark.rapids.Arm._ | |||
import org.apache.spark.unsafe.array.ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH | |||
|
|||
object GetSequenceSize { | |||
val TOO_LONG_SEQUENCE = s"Too long sequence found. Should be <= $MAX_ROUNDED_ARRAY_LENGTH" | |||
def TOO_LONG_SEQUENCE(sequenceLength: Int, functionName: String) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function name should be camelCase
|
||
object SequenceSizeError { | ||
def getTooLongSequenceErrorString(sequenceSize: Int, functionName: String): String = { | ||
QueryExecutionErrors.createArrayWithElementsExceedLimitError(functionName, sequenceSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move this to RapidsErrorUtils
The way I would do it is remove the TOO_LONG_SEQUENCE
from the GetSequenceSize.scala
altogether
Then introduce another trait RapidsErrorUtilsForSequence
with a method tooLongSequenceError
. The versions of 320 will have it's own implementation returning the hardcoded message "Too long..." 334-352 will have it's own implementation returning "Unsuccessful try..." and 400 will have it's own where it returns the QueryExecution.createArrayWithElementsExceedLimitErrors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're talking at cross purposes. Or maybe I'm simply unable to grok the suggestion yet.
The versions of 320 will have it's own implementation returning the hardcoded message "Too long..." 334-352 will have it's own implementation
This largely describes what my patch currently does, save for moving it into RapidsErrorUtils
.
The reason (I think) I can't move this into RapidsErrorUtils
is that the error messages differ within the same major Spark version.
The error messages are split as follows:
Too long sequence found
:- 3.2.*
- 3.3.[x<4]
- 3.4.[x<2]
- 3.5.0
Unsuccessful try to create array
:- 3.3.4
- 3.4.[2-3]
- 3.5.[1-2]
Can't create array with...
:- Only 4.0.
The RapidsErrorUtils
shims are grouped by version, as follows:
- All 3.2.* together.
- All 3.3.* except 3.3.*db
- All 3.3.[0,2]db
- 3.4.[0-3] + 3.5.[0-2] + 4.0.0.
Now if I'm trying to accommodate the correct error message for, say, Spark 3.4.2
, I can't because there's only one RapidsErrorUtils
for all of Spark 3.4.x (and that class happens to affect all 3.5.x as well as 4.0).
Is the suggestion to further slice up 3.4.*
's RapidsErrorUtils
? We will then have to also slice up the same for 3.3.x
and 3.5.x
as well, with code duplicated everywhere. This doesn't sound productive to me.
Maybe I've missed something. Perhaps we should discuss this offline, and update the result on this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactor is turning into a rats' nest. When the next shim needs to be added, and things need to be split further, I think it's going to be unreadable.
RapidsErrorUtilsBase
, used in 33xdb
is a very misleading name for the errors shim. It seems to apply only in 33x
, while its name suggests that it's the base-class for all the RapidsErrorUtils
. This is painful.
I'm going to try to add this with as little collateral damage as I can.
This moves the construction of the long-sequence error strings into RapidsErrorUtils. The process involved introducing many new RapidsErrorUtils classes, and using mix-ins of concrete implementations for the error-string construction.
Apologies for the noise. I had to rebase this to target @razajafri is already examining this change. The others can ignore this. |
Build |
collection_ops_tests
for Spark 4.0collection_ops_tests
for Spark 4.0 [databricks]
Build |
Fixes #11011.
This commit fixes the failures in
collection_ops_tests
on Spark 4.0.On all versions of Spark, when a Sequence is collected with rows that exceed MAX_INT,
an exception is thrown indicating that the collected Sequence/array is
larger than permissible. The different versions of Spark vary in the
contents of the exception message.
On Spark 4, one sees that the error message now contains more
information than all prior versions, including:
This commit introduces a shim to make this new information available in
the exception.
Note that this shim does not fit cleanly in RapidsErrorUtils, because
there are differences within major Spark versions. For instance, Spark
3.4.0-1 have a different message as compared to 3.4.2 and 3.4.3.
Likewise, the differences in 3.5.0, 3.5.1, 3.5.2.