Skip to content

Commit

Permalink
Max min by (#43)
Browse files Browse the repository at this point in the history
* Fix match error in RapidsShuffleIterator.scala [scala2.13] (NVIDIA#11115)

Signed-off-by: xieshuaihu <[email protected]>

* Spark 4: Handle ANSI mode in sort_test.py (NVIDIA#11099)

* Spark 4: Handle ANSI mode in sort_test.py

Fixes NVIDIA#11027.

With ANSI mode enabled (like the default in Spark 4), one sees that some
tests in `sort_test.py` fail, because they expect ANSI mode to be off.

This commit disables running those tests with ANSI enabled, and add a
separate test for ANSI on/off.

Signed-off-by: MithunR <[email protected]>

* Refactored not to use disable_ansi_mode.

These tests need not be revisited.  They test all combinations of ANSI mode,
including overflow failures.

Signed-off-by: MithunR <[email protected]>

---------

Signed-off-by: MithunR <[email protected]>

* Introduce LORE framework. (NVIDIA#11084)

* Introduce lore id

* Introduce lore id

* Fix type

* Fix type

* Conf

* style

* part

* Dump

* Introduce lore framework

* Add tests.

* Rename test case

Signed-off-by: liurenjie1024 <[email protected]>

* Fix AQE test

* Fix style

* Use args to display lore info.

* Fix build break

* Fix path in loreinfo

* Remove path

* Fix comments

* Update configs

* Fix comments

* Fix config

---------

Signed-off-by: liurenjie1024 <[email protected]>

* Support minBy on GPU

Signed-off-by: Firestarman <[email protected]>

---------

Signed-off-by: xieshuaihu <[email protected]>
Signed-off-by: MithunR <[email protected]>
Signed-off-by: liurenjie1024 <[email protected]>
Signed-off-by: Firestarman <[email protected]>
Co-authored-by: xieshuaihu <[email protected]>
Co-authored-by: MithunR <[email protected]>
Co-authored-by: Renjie Liu <[email protected]>
Co-authored-by: Firestarman <[email protected]>
  • Loading branch information
5 people authored Jul 5, 2024
1 parent c703d84 commit 66cf815
Show file tree
Hide file tree
Showing 53 changed files with 583 additions and 68 deletions.
1 change: 1 addition & 0 deletions docs/additional-functionality/advanced_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -407,6 +407,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.Last"></a>spark.rapids.sql.expression.Last|`last_value`, `last`|last aggregate operator|true|None|
<a name="sql.expression.Max"></a>spark.rapids.sql.expression.Max|`max`|Max aggregate operator|true|None|
<a name="sql.expression.Min"></a>spark.rapids.sql.expression.Min|`min`|Min aggregate operator|true|None|
<a name="sql.expression.MinBy"></a>spark.rapids.sql.expression.MinBy|`min_by`|MinBy aggregate operator. It may produce different results than CPU when multiple rows in a group have same minimum value in the ordering column and different associated values in the value column.|true|None|
<a name="sql.expression.Percentile"></a>spark.rapids.sql.expression.Percentile|`percentile`|Aggregation computing exact percentile|true|None|
<a name="sql.expression.PivotFirst"></a>spark.rapids.sql.expression.PivotFirst| |PivotFirst operator|true|None|
<a name="sql.expression.StddevPop"></a>spark.rapids.sql.expression.StddevPop|`stddev_pop`|Aggregation computing population standard deviation|true|None|
Expand Down
262 changes: 210 additions & 52 deletions docs/supported_ops.md
Original file line number Diff line number Diff line change
Expand Up @@ -18090,6 +18090,138 @@ are limited.
<th>UDT</th>
</tr>
<tr>
<td rowSpan="6">MinBy</td>
<td rowSpan="6">`min_by`</td>
<td rowSpan="6">MinBy aggregate operator. It may produce different results than CPU when multiple rows in a group have same minimum value in the ordering column and different associated values in the value column.</td>
<td rowSpan="6">None</td>
<td rowSpan="3">aggregation</td>
<td>value</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>ordering</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT</em></td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>result</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
<td rowSpan="3">reduction</td>
<td>value</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>ordering</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT</em></td>
<td> </td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, CALENDAR, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
<td>result</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><em>PS<br/>UTC is only supported TZ for TIMESTAMP</em></td>
<td>S</td>
<td>S</td>
<td>S</td>
<td>S</td>
<td><b>NS</b></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types CALENDAR, UDT</em></td>
<td><b>NS</b></td>
</tr>
<tr>
<td rowSpan="8">Percentile</td>
<td rowSpan="8">`percentile`</td>
<td rowSpan="8">Aggregation computing exact percentile</td>
Expand Down Expand Up @@ -18396,6 +18528,32 @@ are limited.
<td><b>NS</b></td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="6">StddevPop</td>
<td rowSpan="6">`stddev_pop`</td>
<td rowSpan="6">Aggregation computing population standard deviation</td>
Expand Down Expand Up @@ -18529,32 +18687,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="6">StddevSamp</td>
<td rowSpan="6">`std`, `stddev_samp`, `stddev`</td>
<td rowSpan="6">Aggregation computing sample standard deviation</td>
Expand Down Expand Up @@ -18821,6 +18953,32 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="6">VariancePop</td>
<td rowSpan="6">`var_pop`</td>
<td rowSpan="6">Aggregation computing population variance</td>
Expand Down Expand Up @@ -18954,32 +19112,6 @@ are limited.
<td> </td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="6">VarianceSamp</td>
<td rowSpan="6">`var_samp`, `variance`</td>
<td rowSpan="6">Aggregation computing sample variance</td>
Expand Down Expand Up @@ -19186,6 +19318,32 @@ are limited.
<td><b>NS</b></td>
</tr>
<tr>
<th>Expression</th>
<th>SQL Functions(s)</th>
<th>Description</th>
<th>Notes</th>
<th>Context</th>
<th>Param/Output</th>
<th>BOOLEAN</th>
<th>BYTE</th>
<th>SHORT</th>
<th>INT</th>
<th>LONG</th>
<th>FLOAT</th>
<th>DOUBLE</th>
<th>DATE</th>
<th>TIMESTAMP</th>
<th>STRING</th>
<th>DECIMAL</th>
<th>NULL</th>
<th>BINARY</th>
<th>CALENDAR</th>
<th>ARRAY</th>
<th>MAP</th>
<th>STRUCT</th>
<th>UDT</th>
</tr>
<tr>
<td rowSpan="2">HiveGenericUDF</td>
<td rowSpan="2"> </td>
<td rowSpan="2">Hive Generic UDF, the UDF can choose to implement a RAPIDS accelerated interface to get better performance</td>
Expand Down
14 changes: 14 additions & 0 deletions integration_tests/src/main/python/hash_aggregate_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -1281,6 +1281,20 @@ def test_generic_reductions(data_gen):
'count(1)'),
conf=local_conf)

@ignore_order(local=True)
def test_hash_groupby_with_minby():
assert_gpu_and_cpu_are_equal_collect(
lambda spark: three_col_df(spark, int_gen, int_gen, UniqueLongGen())
.groupby('a').agg(f.min_by('b', 'c'))
)

@ignore_order(local=True)
def test_reduction_with_minby():
assert_gpu_and_cpu_are_equal_collect(
lambda spark: two_col_df(spark, int_gen, UniqueLongGen()).selectExpr(
"min_by(a, b)")
)

@pytest.mark.parametrize('data_gen', all_gen + _nested_gens, ids=idfn)
@allow_non_gpu(*non_utc_allow)
def test_count(data_gen):
Expand Down
Loading

0 comments on commit 66cf815

Please sign in to comment.