Fix `collection_ops_tests` for Spark 4.0 [databricks] #11414

mythrocks · 2024-08-30T23:32:23Z

This commit fixes the failures in collection_ops_tests on Spark 4.0.

On all versions of Spark, when a Sequence is collected with rows that exceed MAX_INT,
an exception is thrown indicating that the collected Sequence/array is
larger than permissible. The different versions of Spark vary in the
contents of the exception message.

On Spark 4, one sees that the error message now contains more
information than all prior versions, including:

The name of the op causing the error.
The errant sequence size.

This commit introduces a shim to make this new information available in
the exception.

Note that this shim does not fit cleanly in RapidsErrorUtils, because
there are differences within major Spark versions. For instance, Spark
3.4.0-1 have a different message as compared to 3.4.2 and 3.4.3.
Likewise, the differences in 3.5.0, 3.5.1, 3.5.2.

Fixes NVIDIA#11011. This commit fixes the failures in `collection_ops_tests` on Spark 4.0. On all versions of Spark, when a Sequence is collected with rows that exceed MAX_INT, an exception is thrown indicating that the collected Sequence/array is larger than permissible. The different versions of Spark vary in the contents of the exception message. On Spark 4, one sees that the error message now contains more information than all prior versions, including: 1. The name of the op causing the error 2. The errant sequence size This commit introduces a shim to make this new information available in the exception. Note that this shim does not fit cleanly in RapidsErrorUtils, because there are differences within major Spark versions. For instance, Spark 3.4.0-1 have a different message as compared to 3.4.2 and 3.4.3. Likewise, the differences in 3.5.0, 3.5.1, 3.5.2. Signed-off-by: MithunR <[email protected]>

mythrocks · 2024-09-04T22:59:42Z

Build

razajafri · 2024-09-05T18:34:54Z

integration_tests/src/main/python/collection_ops_test.py

@@ -17,6 +17,8 @@
 from asserts import assert_gpu_and_cpu_are_equal_collect, assert_gpu_and_cpu_error
 from data_gen import *
 from pyspark.sql.types import *
+
+from src.main.python.spark_session import is_before_spark_400


nit: To be consistent with other files, this should just be

Suggested change

from src.main.python.spark_session import is_before_spark_400

from spark_session import is_before_spark_400

razajafri · 2024-09-05T18:46:10Z

sql-plugin/src/main/spark320/scala/com/nvidia/spark/rapids/shims/GetSequenceSize.scala

@@ -42,7 +42,12 @@ import com.nvidia.spark.rapids.Arm._
 import org.apache.spark.unsafe.array.ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH

 object GetSequenceSize {
-  val TOO_LONG_SEQUENCE = s"Too long sequence found. Should be <= $MAX_ROUNDED_ARRAY_LENGTH"
+  def TOO_LONG_SEQUENCE(sequenceLength: Int, functionName: String) = {


function name should be camelCase

razajafri · 2024-09-05T20:19:03Z

sql-plugin/src/main/spark400/scala/org/apache/spark/sql/rapids/shims/SequenceSizeError.scala

+
+object SequenceSizeError {
+  def getTooLongSequenceErrorString(sequenceSize: Int, functionName: String): String = {
+    QueryExecutionErrors.createArrayWithElementsExceedLimitError(functionName, sequenceSize)


We should move this to RapidsErrorUtils

The way I would do it is remove the TOO_LONG_SEQUENCE from the GetSequenceSize.scala altogether

Then introduce another trait RapidsErrorUtilsForSequence with a method tooLongSequenceError. The versions of 320 will have it's own implementation returning the hardcoded message "Too long..." 334-352 will have it's own implementation returning "Unsuccessful try..." and 400 will have it's own where it returns the QueryExecution.createArrayWithElementsExceedLimitErrors

I think we're talking at cross purposes. Or maybe I'm simply unable to grok the suggestion yet.

The versions of 320 will have it's own implementation returning the hardcoded message "Too long..." 334-352 will have it's own implementation

This largely describes what my patch currently does, save for moving it into RapidsErrorUtils.

The reason (I think) I can't move this into RapidsErrorUtils is that the error messages differ within the same major Spark version.

The error messages are split as follows:

Too long sequence found:

3.2.*

3.3.[x<4]

3.4.[x<2]

3.5.0

Unsuccessful try to create array:

3.3.4

3.4.[2-3]

3.5.[1-2]

Can't create array with...:

Only 4.0.

The RapidsErrorUtils shims are grouped by version, as follows:

All 3.2.* together.

All 3.3.* except 3.3.*db

All 3.3.[0,2]db

3.4.[0-3] + 3.5.[0-2] + 4.0.0.

Now if I'm trying to accommodate the correct error message for, say, Spark 3.4.2, I can't because there's only one RapidsErrorUtils for all of Spark 3.4.x (and that class happens to affect all 3.5.x as well as 4.0).

Is the suggestion to further slice up 3.4.*'s RapidsErrorUtils? We will then have to also slice up the same for 3.3.x and 3.5.x as well, with code duplicated everywhere. This doesn't sound productive to me.

Maybe I've missed something. Perhaps we should discuss this offline, and update the result on this bug.

This refactor is turning into a rats' nest. When the next shim needs to be added, and things need to be split further, I think it's going to be unreadable.

RapidsErrorUtilsBase, used in 33xdb is a very misleading name for the errors shim. It seems to apply only in 33x, while its name suggests that it's the base-class for all the RapidsErrorUtils. This is painful.

I'm going to try to add this with as little collateral damage as I can.

…tion-ops-tests

This moves the construction of the long-sequence error strings into RapidsErrorUtils. The process involved introducing many new RapidsErrorUtils classes, and using mix-ins of concrete implementations for the error-string construction.

mythrocks · 2024-09-28T02:04:05Z

Apologies for the noise. I had to rebase this to target branch-24.12, which then caused a lot of new reviewers to be added.

@razajafri is already examining this change. The others can ignore this.

mythrocks · 2024-09-28T06:18:50Z

Build

mythrocks · 2024-09-28T13:57:34Z

Build

mythrocks self-assigned this Aug 30, 2024

mythrocks added the Spark 4.0+ Spark 4.0+ issues label Aug 30, 2024

mythrocks force-pushed the spark4-collection-ops-tests branch from 5606503 to b8bd960 Compare August 30, 2024 23:33

mythrocks requested a review from razajafri September 3, 2024 16:31

Fixed formatting error.

ee2eb81

razajafri reviewed Sep 5, 2024

View reviewed changes

mythrocks added 3 commits September 23, 2024 13:01

Merge remote-tracking branch 'origin/branch-24.10' into spark4-collec…

1037b69

…tion-ops-tests

Merge remote-tracking branch 'origin/branch-24.12' into spark4-collec…

3bcf04f

…tion-ops-tests

Review comments.

2427bf3

This moves the construction of the long-sequence error strings into RapidsErrorUtils. The process involved introducing many new RapidsErrorUtils classes, and using mix-ins of concrete implementations for the error-string construction.

mythrocks requested review from jlowe, revans2, tgravescs, GaryShen2008, NvTimLiu and gerashegalov as code owners September 28, 2024 02:01

mythrocks changed the base branch from branch-24.10 to branch-24.12 September 28, 2024 02:02

mythrocks removed request for jlowe, gerashegalov, revans2, tgravescs, GaryShen2008 and NvTimLiu September 28, 2024 02:04

Added missing shim tag for 3.5.2.

cc4ae45

mythrocks changed the title ~~Fix collection_ops_tests for Spark 4.0~~ Fix collection_ops_tests for Spark 4.0 [databricks] Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `collection_ops_tests` for Spark 4.0 [databricks] #11414

Fix `collection_ops_tests` for Spark 4.0 [databricks] #11414

mythrocks commented Aug 30, 2024

mythrocks commented Sep 4, 2024

razajafri Sep 5, 2024

razajafri Sep 5, 2024

razajafri Sep 5, 2024

mythrocks Sep 23, 2024

mythrocks Sep 27, 2024

mythrocks commented Sep 28, 2024

mythrocks commented Sep 28, 2024

mythrocks commented Sep 28, 2024

	from src.main.python.spark_session import is_before_spark_400
	from spark_session import is_before_spark_400

Fix collection_ops_tests for Spark 4.0 [databricks] #11414

Are you sure you want to change the base?

Fix collection_ops_tests for Spark 4.0 [databricks] #11414

Conversation

mythrocks commented Aug 30, 2024

mythrocks commented Sep 4, 2024

razajafri Sep 5, 2024

Choose a reason for hiding this comment

razajafri Sep 5, 2024

Choose a reason for hiding this comment

razajafri Sep 5, 2024

Choose a reason for hiding this comment

mythrocks Sep 23, 2024

Choose a reason for hiding this comment

mythrocks Sep 27, 2024

Choose a reason for hiding this comment

mythrocks commented Sep 28, 2024

mythrocks commented Sep 28, 2024

mythrocks commented Sep 28, 2024

Fix `collection_ops_tests` for Spark 4.0 [databricks] #11414

Fix `collection_ops_tests` for Spark 4.0 [databricks] #11414