Skip to content

Commit

Permalink
[DOC] updates gh-pages files for 24.02.0 release [skip ci] (#10447)
Browse files Browse the repository at this point in the history
* gh pages update

Signed-off-by: Suraj Aralihalli <[email protected]>

* fix broken table in github.io; fix broken link

Signed-off-by: Suraj Aralihalli <[email protected]>

---------

Signed-off-by: Suraj Aralihalli <[email protected]>
  • Loading branch information
SurajAralihalli authored Mar 12, 2024
1 parent 8c11690 commit e5438b6
Show file tree
Hide file tree
Showing 12 changed files with 1,235 additions and 668 deletions.
19 changes: 14 additions & 5 deletions docs/additional-functionality/advanced_configs.md

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions docs/additional-functionality/rapids-udfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ implementation alongside the CPU implementation, enabling the
RAPIDS Accelerator to perform the user-defined operation on the GPU.

Note that there are other potential solutions to performing user-defined
operations on the GPU. See the
[Frequently Asked Questions entry](../FAQ.md#how-can-i-run-custom-expressionsudfs-on-the-gpu)
operations on the GPU. See the
[Frequently Asked Questions entry](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#how-can-i-run-custom-expressions-udfs-on-the-gpu)
on UDFs for more details.

## UDF Obstacles To Query Acceleration
Expand Down Expand Up @@ -52,7 +52,7 @@ Other forms of Spark UDFs are not supported, such as:

For supported UDFs, the RAPIDS Accelerator will detect a GPU implementation
if the UDF class implements the
[RapidsUDF](../../sql-plugin/src/main/java/com/nvidia/spark/RapidsUDF.java)
[RapidsUDF](../../sql-plugin-api/src/main/java/com/nvidia/spark/RapidsUDF.java)
interface. Unlike the CPU UDF which processes data one row at a time, the
GPU version processes a columnar batch of rows. This reduces invocation
overhead and enables parallel processing of the data by the GPU.
Expand Down Expand Up @@ -219,7 +219,7 @@ The following configuration settings are also relevant for GPU scheduling for Pa
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.memory.gpu.maxAllocFraction= 0.2 \
```
Similar to the [RMM pooling for JVM](../tuning-guide.md#pooled-memory) settings like
Similar to the [RMM pooling for JVM](https://docs.nvidia.com/spark-rapids/user-guide/latest/tuning-guide.html#pinned-memory) settings like
`spark.rapids.memory.gpu.allocFraction` and `spark.rapids.memory.gpu.maxAllocFraction` except
these specify the GPU pool size for the _Python processes_. Half of the GPU _available_ memory
will be used by default if it is not specified.
Expand Down
91 changes: 91 additions & 0 deletions docs/archive.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,96 @@ nav_order: 15
---
Below are archived releases for RAPIDS Accelerator for Apache Spark.

## Release v23.12.2
### Hardware Requirements:

The plugin is tested on the following architectures:

GPU Models: NVIDIA V100, T4, A10/A100, L4 and H100 GPUs

### Software Requirements:

OS: Ubuntu 20.04, Ubuntu 22.04, CentOS 7, or Rocky Linux 8

NVIDIA Driver*: R470+

Runtime:
Scala 2.12, 2.13
Python, Java Virtual Machine (JVM) compatible with your spark-version.

* Check the Spark documentation for Python and Java version compatibility with your specific
Spark version. For instance, visit `https://spark.apache.org/docs/3.4.1` for Spark 3.4.1.

Supported Spark versions:
Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.4
Apache Spark 3.3.0, 3.3.1, 3.3.2, 3.3.3
Apache Spark 3.4.0, 3.4.1
Apache Spark 3.5.0

Supported Databricks runtime versions for Azure and AWS:
Databricks 10.4 ML LTS (GPU, Scala 2.12, Spark 3.2.1)
Databricks 11.3 ML LTS (GPU, Scala 2.12, Spark 3.3.0)
Databricks 12.2 ML LTS (GPU, Scala 2.12, Spark 3.3.2)

Supported Dataproc versions:
GCP Dataproc 2.0
GCP Dataproc 2.1

Supported Dataproc Serverless versions:
Spark runtime 1.1 LTS

*Some hardware may have a minimum driver version greater than R470. Check the GPU spec sheet
for your hardware's minimum driver version.

*For Cloudera and EMR support, please refer to the
[Distributions](https://docs.nvidia.com/spark-rapids/user-guide/latest/faq.html#which-distributions-are-supported) section of the FAQ.

### RAPIDS Accelerator's Support Policy for Apache Spark
The RAPIDS Accelerator maintains support for Apache Spark versions available for download from [Apache Spark](https://spark.apache.org/downloads.html)

### Download RAPIDS Accelerator for Apache Spark v23.12.2
- **Scala 2.12:**
- [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.12 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.12.2/rapids-4-spark_2.12-23.12.2.jar)
- [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.12 jar.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/23.12.2/rapids-4-spark_2.12-23.12.2.jar.asc)

- **Scala 2.13:**
- [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.13 jar](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/23.12.2/rapids-4-spark_2.13-23.12.2.jar)
- [RAPIDS Accelerator for Apache Spark 23.12.2 - Scala 2.13 jar.asc](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.13/23.12.2/rapids-4-spark_2.13-23.12.2.jar.asc)

This package is built against CUDA 11.8. It is tested on V100, T4, A10, A100, L4 and H100 GPUs with
CUDA 11.8 through CUDA 12.0.

### Verify signature
* Download the [PUB_KEY](https://keys.openpgp.org/[email protected]).
* Import the public key: `gpg --import PUB_KEY`
* Verify the signature for Scala 2.12 jar:
`gpg --verify rapids-4-spark_2.12-23.12.2.jar.asc rapids-4-spark_2.12-23.12.2.jar`
* Verify the signature for Scala 2.13 jar:
`gpg --verify rapids-4-spark_2.13-23.12.2.jar.asc rapids-4-spark_2.13-23.12.2.jar`

The output of signature verify:

gpg: Good signature from "NVIDIA Spark (For the signature of spark-rapids release jars) <[email protected]>"

### Release Notes
New functionality and performance improvements for this release include:
* Introduced support for chunked reading of ORC files.
* Enhanced support for additional time zones and added stack function support.
* Enhanced performance for join and aggregation operations.
* Kernel optimizations have been implemented to improve Parquet read performance.
* RAPIDS Accelerator also built and tested with Scala 2.13.
* Last version to support Pascal-based Nvidia GPUs; discontinued in the next release.
* Introduced support for parquet Legacy rebase mode (spark.sql.parquet.datetimeRebaseModeInRead=LEGACY and spark.sql.parquet.int96RebaseModeInRead=LEGACY)
* Introduced support for Percentile function.
* Delta lake 2.3 support.
* Qualification and Profiling tool:
* Profiling Tool now processes Spark Driver log for GPU runs, enhancing feature analysis.
* Auto-tuner recommendations include AQE settings for optimized performance.
* New configurations in Profiler for enabling off-default features: udfCompiler, incompatibleDateFormats, hasExtendedYearValues.

For a detailed list of changes, please refer to the
[CHANGELOG](https://github.com/NVIDIA/spark-rapids/blob/main/CHANGELOG.md).

## Release v23.12.1
### Hardware Requirements:

Expand Down Expand Up @@ -1571,3 +1661,4 @@ Software Requirements:
Python 3.x, Scala 2.12, Java 8



91 changes: 73 additions & 18 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,7 @@ date. Typically, one that overflowed.

### CSV Floating Point

Parsing floating-point values has the same limitations as [casting from string to float](#String-to-Float).
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).

Also parsing of some values will not produce bit for bit identical results to what the CPU does.
They are within round-off errors except when they are close enough to overflow to Inf or -Inf which
Expand Down Expand Up @@ -219,7 +219,7 @@ Hive text files are very similar to CSV, but not exactly the same.

### Hive Text File Floating Point

Parsing floating-point values has the same limitations as [casting from string to float](#String-to-Float).
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).

Also parsing of some values will not produce bit for bit identical results to what the CPU does.
They are within round-off errors except when they are close enough to overflow to Inf or -Inf which
Expand All @@ -245,7 +245,9 @@ to work for dates after the epoch as described
[here](https://github.com/NVIDIA/spark-rapids/issues/140).

The plugin supports reading `uncompressed`, `snappy`, `zlib` and `zstd` ORC files and writing
`uncompressed` and `snappy` ORC files. At this point, the plugin does not have the ability to fall
`uncompressed`, `snappy` and `zstd` ORC files. At this point, the plugin does not have the
ability to
fall
back to the CPU when reading an unsupported compression format, and will error out in that case.

### Push Down Aggregates for ORC
Expand Down Expand Up @@ -307,7 +309,8 @@ When writing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is currently i
[here](https://github.com/NVIDIA/spark-rapids/issues/144).

The plugin supports reading `uncompressed`, `snappy`, `gzip` and `zstd` Parquet files and writing
`uncompressed` and `snappy` Parquet files. At this point, the plugin does not have the ability to
`uncompressed`, `snappy` and `zstd` Parquet files. At this point, the plugin does not have the
ability to
fall back to the CPU when reading an unsupported compression format, and will error out in that
case.

Expand Down Expand Up @@ -349,7 +352,7 @@ with Spark, and can be enabled by setting `spark.rapids.sql.expression.JsonToStr

Dates are partially supported but there are some known issues:

- Only the default `dateFormat` of `yyyy-MM-dd` is supported. The query will fall back to CPU if any other format
- Only the default `dateFormat` of `yyyy-MM-dd` is supported in Spark 3.1.x. The query will fall back to CPU if any other format
is specified ([#9667](https://github.com/NVIDIA/spark-rapids/issues/9667))
- Strings containing integers with more than four digits will be
parsed as null ([#9664](https://github.com/NVIDIA/spark-rapids/issues/9664)) whereas Spark versions prior to 3.4
Expand Down Expand Up @@ -378,6 +381,14 @@ In particular, the output map is not resulted from a regular JSON parsing but in
* If the input JSON is given as multiple rows, any row containing invalid JSON format will be parsed as an empty
struct instead of a null value ([#9592](https://github.com/NVIDIA/spark-rapids/issues/9592)).

When a JSON attribute contains mixed types (different types in different rows), such as a mix of dictionaries
and lists, Spark will return a string representation of the JSON, but when running on GPU, the default
behavior is to throw an exception. There is an experimental setting
`spark.rapids.sql.json.read.mixedTypesAsString.enabled` that can be set to true to support reading
mixed types as string, but there are known issues where it could also read structs as string in some cases. There
can also be minor formatting differences. Spark will return a parsed and formatted representation, but the
GPU implementation returns the unparsed JSON string.

### `to_json` function

The `to_json` function is disabled by default because it is experimental and has some known incompatibilities
Expand All @@ -391,7 +402,7 @@ Known issues are:

### JSON Floating Point

Parsing floating-point values has the same limitations as [casting from string to float](#String-to-Float).
Parsing floating-point values has the same limitations as [casting from string to float](#string-to-float).

Prior to Spark 3.3.0, reading JSON strings such as `"+Infinity"` when specifying that the data type is `FloatType`
or `DoubleType` caused these values to be parsed even when `allowNonNumericNumbers` is set to false. Also, Spark
Expand Down Expand Up @@ -441,6 +452,44 @@ parse some variants of `NaN` and `Infinity` even when this option is disabled
([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
Spark version 3.3.0 and later.

### get_json_object

The `GetJsonObject` operator takes a JSON formatted string and a JSON path string as input. The
code base for this is currently separate from GPU parsing of JSON for files and `FromJsonObject`.
Because of this the results can be different from each other. Because of several incompatibilities
and bugs in the GPU version of `GetJsonObject` it will be on the CPU by default. If you are
aware of the current limitations with the GPU version, you might see a significant performance
speedup if you enable it by setting `spark.rapids.sql.expression.GetJsonObject` to `true`.

The following is a list of known differences.
* [No input validation](https://github.com/NVIDIA/spark-rapids/issues/10218). If the input string
is not valid JSON Apache Spark returns a null result, but ours will still try to find a match.
* [Escapes are not properly processed for Strings](https://github.com/NVIDIA/spark-rapids/issues/10196).
When returning a result for a quoted string Apache Spark will remove the quotes and replace
any escape sequences with the proper characters. The escape sequence processing does not happen
on the GPU.
* [Invalid JSON paths could throw exceptions](https://github.com/NVIDIA/spark-rapids/issues/10212)
If a JSON path is not valid Apache Spark returns a null result, but ours may throw an exception
and fail the query.
* [Non-string output is not normalized](https://github.com/NVIDIA/spark-rapids/issues/10218)
When returning a result for things other than strings, a number of things are normalized by
Apache Spark, but are not normalized by the GPU, like removing unnecessary white space,
parsing and then serializing floating point numbers, turning single quotes to double quotes,
and removing unneeded escapes for single quotes.

The following is a list of bugs in either the GPU version or arguably in Apache Spark itself.
* https://github.com/NVIDIA/spark-rapids/issues/10219 non-matching quotes in quoted strings
* https://github.com/NVIDIA/spark-rapids/issues/10213 array index notation works without root
* https://github.com/NVIDIA/spark-rapids/issues/10214 unquoted array index notation is not
supported
* https://github.com/NVIDIA/spark-rapids/issues/10215 leading spaces can be stripped from named
keys.
* https://github.com/NVIDIA/spark-rapids/issues/10216 It appears that Spark is flattening some
output, which is different from other implementations including the GPU version.
* https://github.com/NVIDIA/spark-rapids/issues/10217 a JSON path execution bug
* https://issues.apache.org/jira/browse/SPARK-46761 Apache Spark does not allow the `?` character in
a quoted JSON path string.

## Avro

The Avro format read is a very experimental feature which is expected to have some issues, so we disable
Expand Down Expand Up @@ -491,14 +540,18 @@ The following regular expression patterns are not yet supported on the GPU and w
or more results
- Line anchor `$` and string anchors `\Z` are not supported in patterns containing `\W` or `\D`
- Line and string anchors are not supported by `string_split` and `str_to_map`
- Lazy quantifiers, such as `a*?`
- Lazy quantifiers within a choice block such as `(2|\u2029??)+`
- Possessive quantifiers, such as `a*+`
- Character classes that use union, intersection, or subtraction semantics, such as `[a-d[m-p]]`, `[a-z&&[def]]`,
or `[a-z&&[^bc]]`
- Empty groups: `()`

Work is ongoing to increase the range of regular expressions that can run on the GPU.

## URL Parsing

`parse_url` QUERY with a column key could produce different results on CPU and GPU. In Spark, the `key` in `parse_url` could act like a regex, but GPU will match the key exactly. If key is literal, GPU will check if key contains regex special characters and fallback to CPU if it does, but if key is column, it will not be able to fallback. For example, `parse_url("http://foo/bar?abc=BAD&a.c=GOOD", QUERY, "a.c")` will return "BAD" on CPU, but "GOOD" on GPU. See the Spark issue: https://issues.apache.org/jira/browse/SPARK-44500

## Timestamps

Spark stores timestamps internally relative to the JVM time zone. Converting an arbitrary timestamp
Expand Down Expand Up @@ -711,27 +764,29 @@ to `false`.

### Float to String

The GPU will use different precision than Java's toString method when converting floating-point data
types to strings. The GPU uses a lowercase `e` prefix for an exponent while Spark uses uppercase
`E`. As a result the computed string can differ from the default behavior in Spark.

The `format_number` function will retain 10 digits of precision for the GPU when the input is a floating
point number, but Spark will retain up to 17 digits of precision, i.e. `format_number(1234567890.1234567890, 5)`
will return `1,234,567,890.00000` on the GPU and `1,234,567,890.12346` on the CPU. To enable this on the GPU, set [`spark.rapids.sql.formatNumberFloat.enabled`](additional-functionality/advanced_configs.md#sql.formatNumberFloat.enabled) to `true`.
The Rapids Accelerator for Apache Spark uses uses a method based on [ryu](https://github.com/ulfjack/ryu) when converting floating point data type to string. As a result the computed string can differ from the output of Spark in some cases: sometimes the output is shorter (which is arguably more accurate) and sometimes the output may differ in the precise digits output.

This configuration is enabled by default. To disable this operation on the GPU set
[`spark.rapids.sql.castFloatToString.enabled`](additional-functionality/advanced_configs.md#sql.castFloatToString.enabled) to `false`.

The `format_number` function also uses [ryu](https://github.com/ulfjack/ryu) as the solution when formatting floating-point data types to
strings, so results may differ from Spark in the same way. To disable this on the GPU, set
[`spark.rapids.sql.formatNumberFloat.enabled`](additional-functionality/advanced_configs.md#sql.formatNumberFloat.enabled) to `false`.

### String to Float

Casting from string to floating-point types on the GPU returns incorrect results when the string
represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The
default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.
Casting from string to double on the GPU returns incorrect results when the string represents any
number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The default behavior
in Apache Spark is to return `+Infinity` and `-Infinity`, respectively.

- `1.7976931348623158E308 <= x < 1.7976931348623159E308`
- `-1.7976931348623159E308 < x <= -1.7976931348623158E308`

Also, the GPU does not support casting from strings containing hex values.
Casting from string to double on the GPU could also sometimes return incorrect results if the string
contains high precision values. Apache Spark rounds the values to the nearest double, while the GPU
truncates the values directly.

Also, the GPU does not support casting from strings containing hex values to floating-point types.

This configuration is enabled by default. To disable this operation on the GPU set
[`spark.rapids.sql.castStringToFloat.enabled`](additional-functionality/advanced_configs.md#sql.castStringToFloat.enabled) to `false`.
Expand Down
2 changes: 1 addition & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The following is the list of options that `rapids-plugin-4-spark` supports.
On startup use: `--conf [conf key]=[conf value]`. For example:

```
${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-23.12.0-cuda11.jar \
${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-24.02.0-cuda11.jar \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.concurrentGpuTasks=2
```
Expand Down
Loading

0 comments on commit e5438b6

Please sign in to comment.