Skip to content

Commit

Permalink
Drop spark31x shims [databricks] (#11159)
Browse files Browse the repository at this point in the history
* Remove spark31x json lines and shim files

1, Remove spark31x json lines from the source code

2, Remove the files those only for spark31x shims

3, Move the files for spark31x and spark32x+ shims into sql-plugin/src/main/spark320 folder

Signed-off-by: Tim Liu <[email protected]>

* Drop spark31x shims in the build scripts and pom files

Signed-off-by: Tim Liu <[email protected]>

* Restore the accidentally deleted file: OrcStatisticShim.scala

    tests/src/test/spark311/scala/com/nvidia/spark/rapids/shims/OrcStatisticShim.scala
     -->
    tests/src/test/spark321cdh/scala/com/nvidia/spark/rapids/shims/OrcStatisticShim.scala

check if we chan merge this file into?

    tests/src/test/spark320/scala/com/nvidia/spark/rapids/shims/OrcStatisticShim.scala
Signed-off-by: Tim Liu <[email protected]>

* Update Copyright to 2024

Signed-off-by: Tim Liu <[email protected]>

* Remove the 31x in ShimLoader.scala according to the review comments

Signed-off-by: Tim Liu <[email protected]>

* Update the file scala2.13/pom.xml

Signed-off-by: Tim Liu <[email protected]>

* Drop 3.1.x shims in docs, source code and build scripts

    Change the default shim to spark320 from spark311
    in the shims in docs, source code and build scripts

Signed-off-by: Tim Liu <[email protected]>

* Updating the docs for the dropping 31x shims

Signed-off-by: Tim Liu <[email protected]>

* Clean up unused and duplicated 'org/roaringbitmap' folder

To fix: #11175

Clean up unused and duplicated 'org/roaringbitmap' in the spark320 shim folder to walk around for the JACOCO error 'different class with same name', after we drop 31x shims and change the default shim to spark320

Signed-off-by: Tim Liu <[email protected]>

---------

Signed-off-by: Tim Liu <[email protected]>
  • Loading branch information
NvTimLiu authored Jul 12, 2024
1 parent 451463f commit be34c6a
Show file tree
Hide file tree
Showing 182 changed files with 3,364 additions and 6,271 deletions.
21 changes: 9 additions & 12 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,11 @@ mvn verify

After a successful build, the RAPIDS Accelerator jar will be in the `dist/target/` directory.
This will build the plugin for a single version of Spark. By default, this is Apache Spark
3.1.1. To build against other versions of Spark you use the `-Dbuildver=XXX` command line option
to Maven. For instance to build Spark 3.1.1 you would use:
3.2.0. To build against other versions of Spark you use the `-Dbuildver=XXX` command line option
to Maven. For instance to build Spark 3.2.0 you would use:

```shell script
mvn -Dbuildver=311 verify
mvn -Dbuildver=320 verify
```
You can find all available build versions in the top level pom.xml file. If you are building
for Databricks then you should use the `jenkins/databricks/build.sh` script and modify it for
Expand Down Expand Up @@ -110,17 +110,14 @@ If you want to create a jar with multiple versions we have the following options
3. Build for all Apache Spark versions, CDH and Databricks with no SNAPSHOT versions of Spark, only released. Use `-PnoSnaphsotsWithDatabricks`.
4. Build for all Apache Spark versions, CDH and Databricks including SNAPSHOT versions of Spark we have supported for. Use `-PsnapshotsWithDatabricks`
5. Build for an arbitrary combination of comma-separated build versions using `-Dincluded_buildvers=<CSV list of build versions>`.
E.g., `-Dincluded_buildvers=312,330`
E.g., `-Dincluded_buildvers=320,330`

You must first build each of the versions of Spark and then build one final time using the profile for the option you want.

You can also install some manually and build a combined jar. For instance to build non-snapshot versions:

```shell script
mvn clean
mvn -Dbuildver=311 install -Drat.skip=true -DskipTests
mvn -Dbuildver=312 install -Drat.skip=true -DskipTests
mvn -Dbuildver=313 install -Drat.skip=true -DskipTests
mvn -Dbuildver=320 install -Drat.skip=true -DskipTests
mvn -Dbuildver=321 install -Drat.skip=true -DskipTests
mvn -Dbuildver=321cdh install -Drat.skip=true -DskipTests
Expand Down Expand Up @@ -150,9 +147,9 @@ There is a build script `build/buildall` that automates the local build process.
By default, it builds everything that is needed to create a distribution jar for all released (noSnapshots) Spark versions except for Databricks. Other profiles that you can pass using `--profile=<distribution profile>` include
- `snapshots` that includes all released (noSnapshots) and snapshots Spark versions except for Databricks
- `minimumFeatureVersionMix` that currently includes 321cdh, 312, 320, 330 is recommended for catching incompatibilities already in the local development cycle
- `minimumFeatureVersionMix` that currently includes 321cdh, 320, 330 is recommended for catching incompatibilities already in the local development cycle
For initial quick iterations we can use `--profile=<buildver>` to build a single-shim version. e.g., `--profile=311` for Spark 3.1.1.
For initial quick iterations we can use `--profile=<buildver>` to build a single-shim version. e.g., `--profile=320` for Spark 3.2.0.
The option `--module=<module>` allows to limit the number of build steps. When iterating, we often don't have the need for the entire build. We may be interested in building everything necessary just to run integration tests (`--module=integration_tests`), or we may want to just rebuild the distribution jar (`--module=dist`)
Expand Down Expand Up @@ -201,7 +198,7 @@ NOTE: Build process does not require an ARM machine, so if you want to build the
on X86 machine, please also add `-DskipTests` in commands.
```bash
mvn clean verify -Dbuildver=311 -Parm64
mvn clean verify -Dbuildver=320 -Parm64
```
### Iterative development during local testing
Expand Down Expand Up @@ -377,7 +374,7 @@ the symlink `.bloop` to point to the corresponding directory `.bloop-spark3XY`
Example usage:
```Bash
./build/buildall --generate-bloop --profile=311,330
./build/buildall --generate-bloop --profile=320,330
rm -vf .bloop
ln -s .bloop-spark330 .bloop
```
Expand Down Expand Up @@ -414,7 +411,7 @@ Install [Scala Metals extension](https://scalameta.org/metals/docs/editors/vscod
either locally or into a Remote-SSH extension destination depending on your target environment.
When your project folder is open in VS Code, it may prompt you to import Maven project.
IMPORTANT: always decline with "Don't ask again", otherwise it will overwrite the Bloop projects
generated with the default `311` profile. If you need to use a different profile, always rerun the
generated with the default `320` profile. If you need to use a different profile, always rerun the
command above manually. When regenerating projects it's recommended to proceed to Metals
"Build commands" View, and click:
1. "Restart build server"
Expand Down
70 changes: 1 addition & 69 deletions aggregator/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -252,79 +252,11 @@

<profiles>
<profile>
<id>release311</id>
<id>release320</id>
<activation>
<!-- #if scala-2.12 -->
<activeByDefault>true</activeByDefault>
<!-- #endif scala-2.12 -->
<property>
<name>buildver</name>
<value>311</value>
</property>
</activation>
<dependencies>
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-delta-stub_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<classifier>${spark.version.classifier}</classifier>
</dependency>
</dependencies>
</profile>
<profile>
<id>release312</id>
<activation>
<property>
<name>buildver</name>
<value>312</value>
</property>
</activation>
<dependencies>
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-delta-stub_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<classifier>${spark.version.classifier}</classifier>
</dependency>
</dependencies>
</profile>
<profile>
<id>release313</id>
<activation>
<property>
<name>buildver</name>
<value>313</value>
</property>
</activation>
<dependencies>
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-delta-stub_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<classifier>${spark.version.classifier}</classifier>
</dependency>
</dependencies>
</profile>
<profile>
<id>release314</id>
<activation>
<property>
<name>buildver</name>
<value>314</value>
</property>
</activation>
<dependencies>
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-delta-stub_${scala.binary.version}</artifactId>
<version>${project.version}</version>
<classifier>${spark.version.classifier}</classifier>
</dependency>
</dependencies>
</profile>
<profile>
<id>release320</id>
<activation>
<property>
<name>buildver</name>
<value>320</value>
Expand Down
2 changes: 1 addition & 1 deletion api_validation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ cd api_validation
sh auditAllVersions.sh
// To run script on particular version we can use profile
mvn scala:run -P spark311
mvn scala:run -P spark320
```

# Output
Expand Down
4 changes: 2 additions & 2 deletions api_validation/auditAllVersions.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/bin/bash
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
# Copyright (c) 2020-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -14,4 +14,4 @@
# limitations under the License.
set -ex

mvn scala:run -P spark311
mvn scala:run -P spark320
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
* Copyright (c) 2020-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -69,7 +69,7 @@ object ApiValidation extends Logging {
val gpuKeys = gpuExecs.keys
var printNewline = false

val sparkToShimMap = Map("3.1.1" -> "spark311")
val sparkToShimMap = Map("3.2.0" -> "spark320")
val sparkVersion = ShimLoader.getShimVersion.toString
val shimVersion = sparkToShimMap(sparkVersion)

Expand Down
6 changes: 3 additions & 3 deletions build/buildall
Original file line number Diff line number Diff line change
Expand Up @@ -274,8 +274,8 @@ export -f build_single_shim
# Install all the versions for DIST_PROFILE

# First build the aggregator module for all SPARK_SHIM_VERSIONS in parallel skipping expensive plugins that
# - either deferred to 311 because the check is identical in all shim profiles such as scalastyle
# - or deferred to 311 because we currently don't require it per shim such as scaladoc generation
# - either deferred to 320 because the check is identical in all shim profiles such as scalastyle
# - or deferred to 320 because we currently don't require it per shim such as scaladoc generation
# - or there is a dedicated step to run against a particular shim jar such as unit tests, in
# the near future we will run unit tests against a combined multi-shim jar to catch classloading
# regressions even before pytest-based integration_tests
Expand All @@ -296,7 +296,7 @@ time (
fi
# This used to resume from dist. However, without including aggregator in the build
# the build does not properly initialize spark.version property via buildver profiles
# in the root pom, and we get a missing spark311 dependency even for --profile=312,321
# in the root pom, and we get a missing spark320 dependency even for --profile=320,321
# where the build does not require it. Moving it to aggregator resolves this issue with
# a negligible increase of the build time by ~2 seconds.
joinShimBuildFrom="aggregator"
Expand Down
2 changes: 1 addition & 1 deletion build/coverage-report
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ TMP_CLASS=${TEMP_CLASS_LOC:-"./target/jacoco_classes/"}
HTML_LOC=${HTML_LOCATION:="./target/jacoco-report/"}
XML_LOC=${XML_LOCATION:="${HTML_LOC}"}
DIST_JAR=${RAPIDS_DIST_JAR:-$(ls ./dist/target/rapids-4-spark_2.12-*cuda*.jar | grep -v test | head -1 | xargs readlink -f)}
SPK_VER=${JACOCO_SPARK_VER:-"311"}
SPK_VER=${JACOCO_SPARK_VER:-"320"}
UDF_JAR=${RAPIDS_UDF_JAR:-$(ls ./udf-compiler/target/spark${SPK_VER}/rapids-4-spark-udf_2.12-*-SNAPSHOT-spark${SPK_VER}.jar | grep -v test | head -1 | xargs readlink -f)}
SOURCE_DIRS=${SOURCE_DIRS:-"./sql-plugin/src/main/scala/:./sql-plugin/src/main/java/:./shuffle-plugin/src/main/scala/:./udf-compiler/src/main/scala/"}

Expand Down
6 changes: 3 additions & 3 deletions build/make-scala-version-build-files.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env bash
#
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2023-2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -20,7 +20,7 @@ set -e

VALID_VERSIONS=( 2.13 )
declare -A DEFAULT_SPARK
DEFAULT_SPARK[2.12]="spark311"
DEFAULT_SPARK[2.12]="spark320"
DEFAULT_SPARK[2.13]="spark330"

usage() {
Expand Down Expand Up @@ -94,4 +94,4 @@ sed_i '/<spark\-rapids\-jni\.version>/,/<scala\.binary\.version>[0-9]*\.[0-9]*</
# Match any scala version to ensure idempotency
SCALA_VERSION=$(mvn help:evaluate -Pscala-${TO_VERSION} -Dexpression=scala.version -q -DforceStdout)
sed_i '/<spark\-rapids\-jni\.version>/,/<scala.version>[0-9]*\.[0-9]*\.[0-9]*</s/<scala\.version>[0-9]*\.[0-9]*\.[0-9]*</<scala.version>'$SCALA_VERSION'</' \
"$TO_DIR/pom.xml"
"$TO_DIR/pom.xml"
8 changes: 4 additions & 4 deletions build/shimplify.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2023, NVIDIA CORPORATION.
# Copyright (c) 2023-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -67,15 +67,15 @@
Each line is assumed to be a JSON to keep it extensible.
/*** spark-rapids-shim-json-lines
{"spark": "312"}
{"spark": "320"}
{"spark": "323"}
spark-rapids-shim-json-lines ***/
The canonical location of a source file shared by multiple shims is
src/main/<top_buildver_in_the_comment>
You can find all shim files for a particular shim, e.g. 312, easily by executing:
git grep '{"spark": "312"}' '*.java' '*.scala'
You can find all shim files for a particular shim, e.g. 320, easily by executing:
git grep '{"spark": "320"}' '*.java' '*.scala'
"""

import errno
Expand Down

This file was deleted.

10 changes: 5 additions & 5 deletions dist/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,21 +17,21 @@ Files are: `com.nvidia.spark.rapids.SparkShimServiceProvider.sparkNonSnapshot`,

The new uber jar is structured like:

1. Base common classes are user visible classes. For these we use Spark 3.1.1 versions because they are assumed to be
1. Base common classes are user visible classes. For these we use Spark 3.2.0 versions because they are assumed to be
bitwise-identical to the other shims, this assumption is subject to the future automatic validation.
2. META-INF/services. This is a file that has to list all the shim versions supported by this jar.
The files talked about above for each profile are put into place here for uber jars. Although we currently do not use
[ServiceLoader API](https://docs.oracle.com/javase/8/docs/api/java/util/ServiceLoader.html) we use the same service
provider discovery mechanism
3. META-INF base files are from 3.1.1 - maven, LICENSE, NOTICE, etc
3. META-INF base files are from 3.2.0 - maven, LICENSE, NOTICE, etc
4. Spark specific directory (aka Parallel World in the jargon of
[ParallelWorldClassloader](https://github.com/openjdk/jdk/blob/jdk8-b120/jaxws/src/share/jaxws_classes/com/sun/istack/internal/tools/ParallelWorldClassLoader.java))
for each version of Spark supported in the jar, i.e., spark311/, spark312/, spark320/, etc.
for each version of Spark supported in the jar, i.e., spark320/, spark330/, spark341/, etc.

If you have to change the contents of the uber jar the following files control what goes into the base jar as classes that are not shaded.

1. `unshimmed-common-from-spark311.txt` - This has classes and files that should go into the base jar with their normal
1. `unshimmed-common-from-spark320.txt` - This has classes and files that should go into the base jar with their normal
package name (not shaded). This includes user visible classes (i.e., com/nvidia/spark/SQLPlugin), python files,
and other files that aren't version specific. Uses Spark 3.1.1 built jar for these base classes as explained above.
and other files that aren't version specific. Uses Spark 3.2.0 built jar for these base classes as explained above.
2. `unshimmed-from-each-spark3xx.txt` - This is applied to all the individual Spark specific version jars to pull
any files that need to go into the base of the jar and not into the Spark specific directory.
8 changes: 4 additions & 4 deletions dist/build/package-parallel-worlds.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2023, NVIDIA CORPORATION.
# Copyright (c) 2023-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -73,8 +73,8 @@ def shell_exec(shell_cmd):
shell_exec(mvn_cmd)

dist_dir = os.sep.join([source_basedir, 'dist'])
with open(os.sep.join([dist_dir, 'unshimmed-common-from-spark311.txt']), 'r') as f:
from_spark311 = f.read().splitlines()
with open(os.sep.join([dist_dir, 'unshimmed-common-from-spark320.txt']), 'r') as f:
from_spark320 = f.read().splitlines()
with open(os.sep.join([dist_dir, 'unshimmed-from-each-spark3xx.txt']), 'r') as f:
from_each = f.read().splitlines()
with zipfile.ZipFile(os.sep.join([deps_dir, art_jar]), 'r') as zip_handle:
Expand All @@ -88,7 +88,7 @@ def shell_exec(shell_cmd):
# TODO deprecate
namelist = zip_handle.namelist()
matching_members = []
glob_list = from_spark311 + from_each if bv == buildver_list[0] else from_each
glob_list = from_spark320 + from_each if bv == buildver_list[0] else from_each
for pat in glob_list:
new_matches = fnmatch.filter(namelist, pat)
matching_members += new_matches
Expand Down
2 changes: 1 addition & 1 deletion dist/maven-antrun/build-parallel-worlds.xml
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@
<!-- Remove the explicily unshimmed files from the common directory -->
<delete>
<fileset dir="${project.build.directory}/parallel-world/spark-shared"
includesfile="${spark.rapids.source.basedir}/${rapids.module}/unshimmed-common-from-spark311.txt"/>
includesfile="${spark.rapids.source.basedir}/${rapids.module}/unshimmed-common-from-spark320.txt"/>
</delete>
</target>
<target name="remove-dependencies-from-pom" depends="build-parallel-worlds">
Expand Down
3 changes: 1 addition & 2 deletions dist/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,6 @@
<id>minimumFeatureVersionMix</id>
<properties>
<included_buildvers>
312,
320,
321cdh,
330,
Expand Down Expand Up @@ -389,7 +388,7 @@ self.log("... OK")
<target>
<taskdef resource="net/sf/antcontrib/antcontrib.properties"/>
<ac:if xmlns:ac="antlib:net.sf.antcontrib">
<equals arg1="spark311" arg2="${spark.version.classifier}"/>
<equals arg1="spark320" arg2="${spark.version.classifier}"/>
<ac:then>
<java classname="com.nvidia.spark.rapids.RapidsConf" failonerror="true">
<arg value="${project.basedir}/../docs/configs.md"/>
Expand Down
Loading

0 comments on commit be34c6a

Please sign in to comment.