[DOC] updates gh-pages files for 24.02.0 release [skip ci] (#10447)

* gh pages update Signed-off-by: Suraj Aralihalli <[email protected]> * fix broken table in github.io; fix broken link Signed-off-by: Suraj Aralihalli <[email protected]> --------- Signed-off-by: Suraj Aralihalli <[email protected]>
NVIDIA · Mar 12, 2024 · e5438b6 · e5438b6
1 parent 8c11690
commit e5438b6
Show file tree

Hide file tree

Showing 12 changed files with 1,235 additions and 668 deletions.
diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md
diff --git a/docs/additional-functionality/rapids-udfs.md b/docs/additional-functionality/rapids-udfs.md
@@ -11,8 +11,8 @@ implementation alongside the CPU implementation, enabling the
 RAPIDS Accelerator to perform the user-defined operation on the GPU.
 
 Note that there are other potential solutions to performing user-defined
-operations on the GPU. See the
-[Frequently Asked Questions entry](../FAQ.md#how-can-i-run-custom-expressionsudfs-on-the-gpu)
+operations on the GPU. See the 
+[Frequently Asked Questions entry](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#how-can-i-run-custom-expressions-udfs-on-the-gpu)
 on UDFs for more details.
 
 ## UDF Obstacles To Query Acceleration
@@ -52,7 +52,7 @@ Other forms of Spark UDFs are not supported, such as:
 
 For supported UDFs, the RAPIDS Accelerator will detect a GPU implementation
 if the UDF class implements the
-[RapidsUDF](../../sql-plugin/src/main/java/com/nvidia/spark/RapidsUDF.java)
+[RapidsUDF](../../sql-plugin-api/src/main/java/com/nvidia/spark/RapidsUDF.java)
 interface. Unlike the CPU UDF which processes data one row at a time, the
 GPU version processes a columnar batch of rows. This reduces invocation
 overhead and enables parallel processing of the data by the GPU.
@@ -219,7 +219,7 @@ The following configuration settings are also relevant for GPU scheduling for Pa
     --conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
     --conf spark.rapids.python.memory.gpu.maxAllocFraction= 0.2 \
     ```
-    Similar to the [RMM pooling for JVM](../tuning-guide.md#pooled-memory) settings like
+    Similar to the [RMM pooling for JVM](https://docs.nvidia.com/spark-rapids/user-guide/latest/tuning-guide.html#pinned-memory) settings like
     `spark.rapids.memory.gpu.allocFraction` and `spark.rapids.memory.gpu.maxAllocFraction` except
     these specify the GPU pool size for the _Python processes_. Half of the GPU _available_ memory
     will be used by default if it is not specified.

diff --git a/docs/archive.md b/docs/archive.md
@@ -5,6 +5,96 @@ nav_order: 15
 ---
 Below are archived releases for RAPIDS Accelerator for Apache Spark.
 
+## Release v23.12.2
+### Hardware Requirements:
+
+The plugin is tested on the following architectures:
+
+	GPU Models: NVIDIA V100, T4, A10/A100, L4 and H100 GPUs
+
+### Software Requirements:
+
+	OS: Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8
+
+	NVIDIA Driver*: R470+
+
+	Runtime: 
+		Scala 2.12, 2.13
+		Python, Java Virtual Machine (JVM) compatible with your spark-version. 
+
+		* Check the Spark documentation for Python and Java version compatibility with your specific 
+		Spark version. For instance, visit `https://spark.apache.org/docs/3.4.1` for Spark 3.4.1.
+
+	Supported Spark versions:
+		Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4
+		Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3
+		Apache Spark 3.4.0, 3.4.1
+		Apache Spark 3.5.0
+
+	Supported Databricks runtime versions for Azure and AWS:
+		Databricks 10.4 ML LTS (GPU, Scala 2.12, Spark 3.2.1)
+		Databricks 11.3 ML LTS (GPU, Scala 2.12, Spark 3.3.0)
+		Databricks 12.2 ML LTS (GPU, Scala 2.12, Spark 3.3.2)
+
+	Supported Dataproc versions:
+		GCP Dataproc 2.0
+		GCP Dataproc 2.1
+
+	Supported Dataproc Serverless versions:
+		Spark runtime 1.1 LTS
+
+*Some hardware may have a minimum driver version greater than R470. Check the GPU spec sheet
+for your hardware's minimum driver version.
+
+*For Cloudera and EMR support, please refer to the
+[Distributions](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#which-distributions-are-supported) section of the FAQ.
+
+### RAPIDS Accelerator's Support Policy for Apache Spark
+The RAPIDS Accelerator maintains support for Apache Spark versions available for download from [Apache Spark](https://spark.apache.org/downloads.html)
+
+### Download RAPIDS Accelerator for Apache Spark v23.12.2
+- **Scala 2.12:**
+  - [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.12 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.12.2/rapids-4-spark_2.12-23.12.2.jar)
+  - [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.12 jar.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.12.2/rapids-4-spark_2.12-23.12.2.jar.asc)
+
+- **Scala 2.13:**
+  - [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.13 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/23.12.2/rapids-4-spark_2.13-23.12.2.jar)
+  - [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.13 jar.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/23.12.2/rapids-4-spark_2.13-23.12.2.jar.asc)
+
+This package is built against CUDA 11.8. It is tested on V100, T4, A10, A100, L4 and H100 GPUs with 
+CUDA 11.8 through CUDA 12.0.
+
+### Verify signature
+* Download the [PUB_KEY](https://keys.openpgp.org/[email protected]).
+* Import the public key: `gpg --import PUB_KEY`
+* Verify the signature for Scala 2.12 jar:
+    `gpg --verify rapids-4-spark_2.12-23.12.2.jar.asc rapids-4-spark_2.12-23.12.2.jar`
+* Verify the signature for Scala 2.13 jar:
+    `gpg --verify rapids-4-spark_2.13-23.12.2.jar.asc rapids-4-spark_2.13-23.12.2.jar`
+
+The output of signature verify:
+
+	gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <[email protected]>"
+
+### Release Notes
+New functionality and performance improvements for this release include:
+* Introduced support for chunked reading of ORC files.
+* Enhanced support for additional time zones and added stack function support.
+* Enhanced performance for join and aggregation operations.
+* Kernel optimizations have been implemented to improve Parquet read performance.
+* RAPIDS Accelerator also built and tested with Scala 2.13.
+* Last version to support Pascal-based Nvidia GPUs; discontinued in the next release.
+* Introduced support for parquet Legacy rebase mode (spark.sql.parquet.datetimeRebaseModeInRead=LEGACY and spark.sql.parquet.int96RebaseModeInRead=LEGACY)
+* Introduced support for Percentile function.
+* Delta lake 2.3 support.
+* Qualification and Profiling tool:
+	* Profiling Tool now processes Spark Driver log for GPU runs, enhancing feature analysis.
+	* Auto-tuner recommendations include AQE settings for optimized performance.
+	* New configurations in Profiler for enabling off-default features: udfCompiler, incompatibleDateFormats, hasExtendedYearValues.
+
+For a detailed list of changes, please refer to the
+[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).
+
 ## Release v23.12.1
 ### Hardware Requirements:
 
@@ -1571,3 +1661,4 @@ Software Requirements:
     Python 3.x, Scala 2.12, Java 8
 
 
+
diff --git a/docs/compatibility.md b/docs/compatibility.md
@@ -180,7 +180,7 @@ date. Typically, one that overflowed.
 
 ### CSV Floating Point
 
-Parsing floating-point values has the same limitations as [casting from string to float](#String-to-Float).
+Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
 
 Also parsing of some values will not produce bit for bit identical results to what the CPU does.
 They are within round-off errors except when they are close enough to overflow to Inf or -Inf which
@@ -219,7 +219,7 @@ Hive text files are very similar to CSV, but not exactly the same.
 
 ### Hive Text File Floating Point
 
-Parsing floating-point values has the same limitations as [casting from string to float](#String-to-Float).
+Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
 
 Also parsing of some values will not produce bit for bit identical results to what the CPU does.
 They are within round-off errors except when they are close enough to overflow to Inf or -Inf which
@@ -245,7 +245,9 @@ to work for dates after the epoch as described
 [here](https://github.com/NVIDIA/spark-rapids/issues/140).
 
 The plugin supports reading `uncompressed`, `snappy`, `zlib` and `zstd` ORC files and writing
- `uncompressed` and `snappy` ORC files.  At this point, the plugin does not have the ability to fall
+ `uncompressed`, `snappy` and `zstd` ORC files.  At this point, the plugin does not have the 
+ability to 
+fall
  back to the CPU when reading an unsupported compression format, and will error out in that case.
 
 ### Push Down Aggregates for ORC
@@ -307,7 +309,8 @@ When writing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is currently i
 [here](https://github.com/NVIDIA/spark-rapids/issues/144).
 
 The plugin supports reading `uncompressed`, `snappy`, `gzip` and `zstd` Parquet files and writing
-`uncompressed` and `snappy` Parquet files.  At this point, the plugin does not have the ability to
+`uncompressed`, `snappy` and `zstd` Parquet files.  At this point, the plugin does not have the 
+ability to
 fall back to the CPU when reading an unsupported compression format, and will error out in that
 case.
 
@@ -349,7 +352,7 @@ with Spark, and can be enabled by setting `spark.rapids.sql.expression.JsonToStr
 
 Dates are partially supported but there are some known issues:
 
-- Only the default `dateFormat` of `yyyy-MM-dd` is supported. The query will fall back to CPU if any other format
+- Only the default `dateFormat` of `yyyy-MM-dd` is supported in Spark 3.1.x. The query will fall back to CPU if any other format
   is specified ([#9667](https://github.com/NVIDIA/spark-rapids/issues/9667))
 - Strings containing integers with more than four digits will be 
   parsed as null ([#9664](https://github.com/NVIDIA/spark-rapids/issues/9664)) whereas Spark versions prior to 3.4 
@@ -378,6 +381,14 @@ In particular, the output map is not resulted from a regular JSON parsing but in
  * If the input JSON is given as multiple rows, any row containing invalid JSON format will be parsed as an empty 
    struct instead of a null value ([#9592](https://github.com/NVIDIA/spark-rapids/issues/9592)).
 
+When a JSON attribute contains mixed types (different types in different rows), such as a mix of dictionaries 
+and lists, Spark will return a string representation of the JSON, but when running on GPU, the default 
+behavior is to throw an exception. There is an experimental setting 
+`spark.rapids.sql.json.read.mixedTypesAsString.enabled` that can be set to true to support reading
+mixed types as string, but there are known issues where it could also read structs as string in some cases. There
+can also be minor formatting differences. Spark will return a parsed and formatted representation, but the
+GPU implementation returns the unparsed JSON string.
+
 ### `to_json` function
 
 The `to_json` function is disabled by default because it is experimental and has some known incompatibilities 
@@ -391,7 +402,7 @@ Known issues are:
 
 ### JSON Floating Point
 
-Parsing floating-point values has the same limitations as [casting from string to float](#String-to-Float).
+Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).
 
 Prior to Spark 3.3.0, reading JSON strings such as `"+Infinity"` when specifying that the data type is `FloatType`
 or `DoubleType` caused these values to be parsed even when `allowNonNumericNumbers` is set to false. Also, Spark
@@ -441,6 +452,44 @@ parse some variants of `NaN` and `Infinity` even when this option is disabled
 ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
 Spark version 3.3.0 and later.
 
+### get_json_object
+
+The `GetJsonObject` operator takes a JSON formatted string and a JSON path string as input. The
+code base for this is currently separate from GPU parsing of JSON for files and `FromJsonObject`.
+Because of this the results can be different from each other. Because of several incompatibilities
+and bugs in the GPU version of `GetJsonObject` it will be on the CPU by default. If you are
+aware of the current limitations with the GPU version, you might see a significant performance
+speedup if you enable it by setting `spark.rapids.sql.expression.GetJsonObject` to `true`.
+
+The following is a list of known differences.
+  * [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string
+    is not valid JSON Apache Spark returns a null result, but ours will still try to find a match.
+  * [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196).
+    When returning a result for a quoted string Apache Spark will remove the quotes and replace
+    any escape sequences with the proper characters. The escape sequence processing does not happen
+    on the GPU.
+  * [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212)
+    If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception
+    and fail the query.
+  * [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218)
+    When returning a result for things other than strings, a number of things are normalized by
+    Apache Spark, but are not normalized by the GPU, like removing unnecessary white space,
+    parsing and then serializing floating point numbers, turning single quotes to double quotes,
+    and removing unneeded escapes for single quotes.
+
+The following is a list of bugs in either the GPU version or arguably in Apache Spark itself.
+   * https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings
+   * https://github.com/NVIDIA/spark-rapids/issues/10213 array index notation works without root
+   * https://github.com/NVIDIA/spark-rapids/issues/10214 unquoted array index notation is not
+     supported
+   * https://github.com/NVIDIA/spark-rapids/issues/10215 leading spaces can be stripped from named
+     keys.
+   * https://github.com/NVIDIA/spark-rapids/issues/10216 It appears that Spark is flattening some
+     output, which is different from other implementations including the GPU version.
+   * https://github.com/NVIDIA/spark-rapids/issues/10217 a JSON path execution bug
+   * https://issues.apache.org/jira/browse/SPARK-46761 Apache Spark does not allow the `?` character in
+     a quoted JSON path string.
+
 ## Avro
 
 The Avro format read is a very experimental feature which is expected to have some issues, so we disable
@@ -491,14 +540,18 @@ The following regular expression patterns are not yet supported on the GPU and w
   or more results
 - Line anchor `$` and string anchors `\Z` are not supported in patterns containing `\W` or `\D`
 - Line and string anchors are not supported by `string_split` and `str_to_map`
-- Lazy quantifiers, such as `a*?`
+- Lazy quantifiers within a choice block such as `(2|\u2029??)+` 
 - Possessive quantifiers, such as `a*+`
 - Character classes that use union, intersection, or subtraction semantics, such as `[a-d[m-p]]`, `[a-z&&[def]]`,
   or `[a-z&&[^bc]]`
 - Empty groups: `()`
 
 Work is ongoing to increase the range of regular expressions that can run on the GPU.
 
+## URL Parsing
+
+`parse_url` QUERY with a column key could produce different results on CPU and GPU. In Spark, the `key` in `parse_url` could act like a regex, but GPU will match the key exactly. If key is literal, GPU will check if key contains regex special characters and fallback to CPU if it does, but if key is column, it will not be able to fallback. For example, `parse_url("http://foo/bar?abc=BAD&a.c=GOOD", QUERY, "a.c")` will return "BAD" on CPU, but "GOOD" on GPU. See the Spark issue: https://issues.apache.org/jira/browse/SPARK-44500
+
 ## Timestamps
 
 Spark stores timestamps internally relative to the JVM time zone.  Converting an arbitrary timestamp
@@ -711,27 +764,29 @@ to `false`.
 
 ### Float to String
 
-The GPU will use different precision than Java's toString method when converting floating-point data
-types to strings. The GPU uses a lowercase `e` prefix for an exponent while Spark uses uppercase
-`E`. As a result the computed string can differ from the default behavior in Spark.
-
-The `format_number` function will retain 10 digits of precision for the GPU when the input is a floating 
-point number, but Spark will retain up to 17 digits of precision, i.e. `format_number(1234567890.1234567890, 5)`
-will return `1,234,567,890.00000` on the GPU and `1,234,567,890.12346` on the CPU. To enable this on the GPU, set [`spark.rapids.sql.formatNumberFloat.enabled`](additional-functionality/advanced_configs.md#sql.formatNumberFloat.enabled) to `true`.
+The Rapids Accelerator for Apache Spark uses uses a method based on [ryu](https://github.com/ulfjack/ryu) when converting floating point data type to string. As a result the computed string can differ from the output of Spark in some cases: sometimes the output is shorter (which is arguably more accurate) and sometimes the output may differ in the precise digits output.
 
 This configuration is enabled by default. To disable this operation on the GPU set
 [`spark.rapids.sql.castFloatToString.enabled`](additional-functionality/advanced_configs.md#sql.castFloatToString.enabled) to `false`.
 
+The `format_number` function also uses [ryu](https://github.com/ulfjack/ryu) as the solution when formatting floating-point data types to 
+strings, so results may differ from Spark in the same way. To disable this on the GPU, set 
+[`spark.rapids.sql.formatNumberFloat.enabled`](additional-functionality/advanced_configs.md#sql.formatNumberFloat.enabled) to `false`.
+
 ### String to Float
 
-Casting from string to floating-point types on the GPU returns incorrect results when the string
-represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The
-default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.
+Casting from string to double on the GPU returns incorrect results when the string represents any 
+number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The default behavior 
+in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.
 
 - `1.7976931348623158E308 <= x < 1.7976931348623159E308`
 - `-1.7976931348623159E308 < x <= -1.7976931348623158E308`
 
-Also, the GPU does not support casting from strings containing hex values.
+Casting from string to double on the GPU could also sometimes return incorrect results if the string 
+contains high precision values. Apache Spark rounds the values to the nearest double, while the GPU 
+truncates the values directly.
+
+Also, the GPU does not support casting from strings containing hex values to floating-point types.
 
 This configuration is enabled by default. To disable this operation on the GPU set
 [`spark.rapids.sql.castStringToFloat.enabled`](additional-functionality/advanced_configs.md#sql.castStringToFloat.enabled) to `false`.

diff --git a/docs/configs.md b/docs/configs.md
@@ -10,7 +10,7 @@ The following is the list of options that `rapids-plugin-4-spark` supports.
 On startup use: `--conf [conf key]=[conf value]`. For example:
 
 ```
-${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-23.12.0-cuda11.jar \
+${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-24.02.0-cuda11.jar \
 --conf spark.plugins=com.nvidia.spark.SQLPlugin \
 --conf spark.rapids.sql.concurrentGpuTasks=2
 ```