forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V3.4.1 lyft #57
Merged
Merged
V3.4.1 lyft #57
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What changes were proposed in this pull request? This PR aims to use `sbt-eclipse` instead of `sbteclipse-plugin`. ### Why are the changes needed? Thanks to SPARK-34959, Apache Spark 3.2+ uses SBT 1.5.0 and we can use `set-eclipse` instead of old `sbteclipse-plugin`. - https://github.com/sbt/sbt-eclipse/releases/tag/6.0.0 ### Does this PR introduce _any_ user-facing change? No, this is a dev-only plugin. ### How was this patch tested? Pass the CIs and manual tests. ``` $ build/sbt eclipse Using /Users/dongjoon/.jenv/versions/1.8 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Using SPARK_LOCAL_IP=localhost Attempting to fetch sbt Launching sbt from build/sbt-launch-1.8.2.jar [info] welcome to sbt 1.8.2 (AppleJDK-8.0.302.8.1 Java 1.8.0_302) [info] loading settings for project spark-merge-build from plugins.sbt ... [info] loading project definition from /Users/dongjoon/APACHE/spark-merge/project [info] Updating https://repo1.maven.org/maven2/com/github/sbt/sbt-eclipse_2.12_1.0/6.0.0/sbt-eclipse-6.0.0.pom 100.0% [##########] 2.5 KiB (4.5 KiB / s) ... ``` Closes apache#40708 from dongjoon-hyun/SPARK-43069. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 9cba552) Signed-off-by: Dongjoon Hyun <[email protected]>
…n Kafka connector ### What changes were proposed in this pull request? This PR moves the error class resource file in Kafka connector from test to src, so that error class works without test artifacts. ### Why are the changes needed? Refer to the `How was this patch tested?`. ### Does this PR introduce _any_ user-facing change? Yes, but the possibility of encountering this is small enough. ### How was this patch tested? Ran spark-shell with Kafka connector artifacts (without test artifacts) and triggered KafkaExceptions to confirm that exception is properly raised. ``` scala> import org.apache.spark.sql.kafka010.KafkaExceptions import org.apache.spark.sql.kafka010.KafkaExceptions scala> import org.apache.kafka.common.TopicPartition import org.apache.kafka.common.TopicPartition scala> KafkaExceptions.mismatchedTopicPartitionsBetweenEndOffsetAndPrefetched(Set[TopicPartition](), Set[TopicPartition]()) res1: org.apache.spark.SparkException = org.apache.spark.SparkException: Kafka data source in Trigger.AvailableNow should provide the same topic partitions in pre-fetched offset to end offset for each microbatch. The error could be transient - restart your query, and report if you still see the same issue. topic-partitions for pre-fetched offset: Set(), topic-partitions for end offset: Set(). ``` Without the fix, triggering KafkaExceptions failed to load error class resource file and led unexpected exception. ``` scala> KafkaExceptions.mismatchedTopicPartitionsBetweenEndOffsetAndPrefetched(Set[TopicPartition](), Set[TopicPartition]()) java.lang.IllegalArgumentException: argument "src" is null at com.fasterxml.jackson.databind.ObjectMapper._assertNotNull(ObjectMapper.java:4885) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3618) at org.apache.spark.ErrorClassesJsonReader$.org$apache$spark$ErrorClassesJsonReader$$readAsMap(ErrorClassesJSONReader.scala:95) at org.apache.spark.ErrorClassesJsonReader.$anonfun$errorInfoMap$1(ErrorClassesJSONReader.scala:44) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.ErrorClassesJsonReader.<init>(ErrorClassesJSONReader.scala:44) at org.apache.spark.sql.kafka010.KafkaExceptions$.<init>(KafkaExceptions.scala:27) at org.apache.spark.sql.kafka010.KafkaExceptions$.<clinit>(KafkaExceptions.scala) ... 47 elided ``` Closes apache#40705 from HeartSaVioR/SPARK-43067. Authored-by: Jungtaek Lim <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]> (cherry picked from commit 7434702) Signed-off-by: Jungtaek Lim <[email protected]>
…lled. ### What changes were proposed in this pull request? Change `gRPC` to `grpcio` This is ONLY in the printing, for users that haven't install `gRPC` ### Why are the changes needed? Users that don't have install `gRPC` will get this error when starting connect. ModuleNotFoundError Traceback (most recent call last) File /opt/spark/python/pyspark/sql/connect/utils.py:45, in require_minimum_grpc_version() 44 try: ---> 45 import grpc 46 except ImportError as error: ModuleNotFoundError: No module named 'grpc' The above exception was the direct cause of the following exception: ImportError Traceback (most recent call last) Cell In[1], line 11 9 import pyarrow 10 from pyspark import SparkConf, SparkContext ---> 11 from pyspark import pandas as ps 12 from pyspark.sql import SparkSession 13 from pyspark.sql.functions import col, concat, concat_ws, expr, lit, trim File /opt/spark/python/pyspark/pandas/__init__.py:59 50 warnings.warn( 51 "'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to " 52 "set this environment variable to '1' in both driver and executor sides if you use " (...) 55 "already launched." 56 ) 57 os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" ---> 59 from pyspark.pandas.frame import DataFrame 60 from pyspark.pandas.indexes.base import Index 61 from pyspark.pandas.indexes.category import CategoricalIndex File /opt/spark/python/pyspark/pandas/frame.py:88 85 from pyspark.sql.window import Window 87 from pyspark import pandas as ps # For running doctests and reference resolution in PyCharm. ---> 88 from pyspark.pandas._typing import ( 89 Axis, 90 DataFrameOrSeries, 91 Dtype, 92 Label, 93 Name, 94 Scalar, 95 T, 96 GenericColumn, 97 ) 98 from pyspark.pandas.accessors import PandasOnSparkFrameMethods 99 from pyspark.pandas.config import option_context, get_option File /opt/spark/python/pyspark/pandas/_typing.py:25 22 from pandas.api.extensions import ExtensionDtype 24 from pyspark.sql.column import Column as PySparkColumn ---> 25 from pyspark.sql.connect.column import Column as ConnectColumn 26 from pyspark.sql.dataframe import DataFrame as PySparkDataFrame 27 from pyspark.sql.connect.dataframe import DataFrame as ConnectDataFrame File /opt/spark/python/pyspark/sql/connect/column.py:19 1 # 2 # Licensed to the Apache Software Foundation (ASF) under one or more 3 # contributor license agreements. See the NOTICE file distributed with (...) 15 # limitations under the License. 16 # 17 from pyspark.sql.connect.utils import check_dependencies ---> 19 check_dependencies(__name__) 21 import datetime 22 import decimal File /opt/spark/python/pyspark/sql/connect/utils.py:35, in check_dependencies(mod_name) 33 require_minimum_pandas_version() 34 require_minimum_pyarrow_version() ---> 35 require_minimum_grpc_version() File /opt/spark/python/pyspark/sql/connect/utils.py:47, in require_minimum_grpc_version() 45 import grpc 46 except ImportError as error: ---> 47 raise ImportError( 48 "grpc >= %s must be installed; however, " "it was not found." % minimum_grpc_version 49 ) from error 50 if LooseVersion(grpc.__version__) < LooseVersion(minimum_grpc_version): 51 raise ImportError( 52 "gRPC >= %s must be installed; however, " 53 "your version was %s." % (minimum_grpc_version, grpc.__version__) 54 ) ImportError: grpc >= 1.48.1 must be installed; however, it was not found. The last line tells that there is a module named `grpc` that's missing. `pip install grpc` Collecting grpc Downloading grpc-1.0.0.tar.gz (5.2 kB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [6 lines of output] Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "/tmp/pip-install-vp4d8s4c/grpc_c0f1992ad8f7456b8ac09ecbaeb81750/setup.py", line 33, in <module> raise RuntimeError(HINT) RuntimeError: Please install the official package with: pip install grpcio [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed × Encountered error while generating package metadata. ╰─> See above for output. note: This is an issue with the package mentioned above, not pip. hint: See above for details. Note: you may need to restart the kernel to use updated packages. [The right way to install this is](https://grpc.io/docs/languages/python/quickstart/) `pip install grpcio` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#40716 from bjornjorgensen/grpc-->-grpcio. Authored-by: bjornjorgensen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 07918fe) Signed-off-by: Hyukjin Kwon <[email protected]>
…ated in `beforeAll` ### What changes were proposed in this pull request? Change tests to avoid replacing and dropping a temporary view that is created in `SubquerySuite#beforeAll`. ### Why are the changes needed? When I added a test for SPARK-42937, it tried to use the view `t`, which is created in `beforeAll`. But because other tests would replace and drop this view, the new test would fail. As a result, that new test had to re-create `t` from scratch. This change will allow `t` to be used by new tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes apache#40717 from bersprockets/bad_drop_view. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit b55bf3c) Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? There are important syntax rules about Cast/Store assignment/Type precedent list in the [ANSI Compliance doc](https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html) As we are going to release [timestamp_ntz](https://issues.apache.org/jira/browse/SPARK-35662) type in Spark 3.4.0, we should update the doc page as well. ### Why are the changes needed? Better documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual build and verify <img width="1183" alt="image" src="https://user-images.githubusercontent.com/1097932/230692965-02ec5a6e-8b8a-48dc-8049-9a87d26b2ce5.png"> <img width="1068" alt="image" src="https://user-images.githubusercontent.com/1097932/230692988-bd35508c-0577-44c5-8448-f8d3b0aef2ea.png"> <img width="764" alt="image" src="https://user-images.githubusercontent.com/1097932/230693005-cb61a760-ea11-4e6d-bdcb-2738c7c507c6.png"> Closes apache#40711 from gengliangwang/ntz_in_ansi_doc. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit c5db391) Signed-off-by: Hyukjin Kwon <[email protected]>
…T for INSERT source relation ### What changes were proposed in this pull request? This PR extends column default support to allow the ORDER BY, LIMIT, and OFFSET clauses at the end of a SELECT query in the INSERT source relation. For example: ``` create table t1(i boolean, s bigint default 42) using parquet; insert into t1 values (true, 41), (false, default); create table t2(i boolean default true, s bigint default 42, t string default 'abc') using parquet; insert into t2 (i, s) select default, s from t1 order by s limit 1; select * from t2; > true, 41L, "abc" ``` ### Why are the changes needed? This improves usability and helps prevent confusing error messages. ### Does this PR introduce _any_ user-facing change? Yes, SQL queries that previously failed will now succeed. ### How was this patch tested? This PR adds new unit test coverage. Closes apache#40710 from dtenedor/column-default-more-patterns. Authored-by: Daniel Tenedorio <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 54e84fe) Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request? This PR aims to mark `StateStoreSuite` and `RocksDBStateStoreSuite` as `ExtendedSQLTest`. ### Why are the changes needed? To balance GitHub Action jobs by offloading heavy tests. - `sql - other tests` took [2 hour 55 minutes](https://github.com/apache/spark/actions/runs/4641961434/jobs/8215437737) - `sql - slow tests` took [1 hour 46 minutes](https://github.com/apache/spark/actions/runs/4641961434/jobs/8215437616) ``` - maintenance (2 seconds, 4 milliseconds) - SPARK-40492: maintenance before unload (2 minutes) - snapshotting (1 second, 96 milliseconds) - SPARK-21145: Restarted queries create new provider instances (1 second, 261 milliseconds) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes apache#40727 from dongjoon-hyun/SPARK-43083. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 086e25b) Signed-off-by: Dongjoon Hyun <[email protected]>
…able names ### What changes were proposed in this pull request? This PR adds support for column DEFAULT assignment for multi-part table names. ### Why are the changes needed? Spark SQL workloads that refer to tables with multi-part names may want to use column DEFAULT functionality. This PR enables this. ### Does this PR introduce _any_ user-facing change? Yes, column DEFAULT assignment now works with multi-part table names. ### How was this patch tested? This PR adds unit tests. Closes apache#40732 from dtenedor/fix-three-part-name-bug. Authored-by: Daniel Tenedorio <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit fa4978a) Signed-off-by: Gengliang Wang <[email protected]>
### What changes were proposed in this pull request? Due to the recent refactor, I realized that the two Hive UDF expressions are stateful as they both keep an array to store the input arguments. This PR fix it. ### Why are the changes needed? to avoid issues in a muti-thread environment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Too hard to write unit tests and the fix itself is very obvious. Closes apache#40781 from cloud-fan/hive. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Chao Sun <[email protected]> (cherry picked from commit e117f62) Signed-off-by: Chao Sun <[email protected]>
… Null Message ### What changes were proposed in this pull request? Fix the bug when Connect Server throw Exception without message. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unnecessary Closes apache#40780 from Hisoka-X/SPARK-43125_Exception_NPE. Authored-by: Hisoka <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit ea49637) Signed-off-by: Hyukjin Kwon <[email protected]>
…rouping functions ### What changes were proposed in this pull request? This PR fixes construct aggregate expressions by replacing grouping functions if a expression is part of aggregation. In the following example, the second `b` should also be replaced: <img width="545" alt="image" src="https://user-images.githubusercontent.com/5399861/230415618-84cd6334-690e-4b0b-867b-ccc4056226a8.png"> ### Why are the changes needed? Fix bug: ``` spark-sql (default)> SELECT CASE WHEN a IS NULL THEN count(b) WHEN b IS NULL THEN count(c) END > FROM grouping > GROUP BY GROUPING SETS (a, b, c); [MISSING_AGGREGATION] The non-aggregating expression "b" is based on columns which are not participating in the GROUP BY clause. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#40685 from wangyum/SPARK-43050. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 45b84cd) Signed-off-by: Yuming Wang <[email protected]> # Conflicts: # sql/core/src/test/resources/sql-tests/analyzer-results/grouping_set.sql.out
…-dml-insert-table.md ### What changes were proposed in this pull request? This PR fixes incorrect column names in [sql-ref-syntax-dml-insert-table.md](https://spark.apache.org/docs/3.4.0/sql-ref-syntax-dml-insert-table.html). ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes apache#40807 from wangyum/SPARK-43139. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 4db8099) Signed-off-by: Yuming Wang <[email protected]>
…link ### What changes were proposed in this pull request? This PR fixes PySpark Connect quick start working (https://mybinder.org/v2/gh/apache/spark/87a5442f7e?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb) ### Why are the changes needed? For end users to try them out easily. Currently it fails. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the quickstart. ### How was this patch tested? Manually tested at https://mybinder.org/v2/gh/HyukjinKwon/spark/quickstart-connect-working?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb Closes apache#40813 from HyukjinKwon/quickstart-connect-working. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 7012732) Signed-off-by: Hyukjin Kwon <[email protected]>
…ration ### What changes were proposed in this pull request? This PR proposes to set the upperbound for pandas in Binder integration. We don't currently support pandas 2.0.0 properly, see also https://issues.apache.org/jira/browse/SPARK-42618 ### Why are the changes needed? To make the quickstarts working. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the quickstart. ### How was this patch tested? Tested in: - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb - https://mybinder.org/v2/gh/HyukjinKwon/spark/set-lower-bound-pandas?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Closes apache#40814 from HyukjinKwon/set-lower-bound-pandas. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit a2592e6) Signed-off-by: Hyukjin Kwon <[email protected]>
…rk in Binder integration ### What changes were proposed in this pull request? This PR fixes Binder integration version strings in case `dev0` is specified. It should work in master branch too (when users manually build the docs and test) ### Why are the changes needed? For end users to run quickstarts. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing quick start. ### How was this patch tested? Manually tested at https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-42475-followup-2?labpath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_connect.ipynb Closes apache#40815 from HyukjinKwon/SPARK-42475-followup-2. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit c408360) Signed-off-by: Hyukjin Kwon <[email protected]>
…v0 to work in Binder integration" This reverts commit 70e86ba.
This PR excludes Java files in `core/target` when running checkstyle with SBT. Files such as .../spark/core/target/scala-2.12/src_managed/main/org/apache/spark/status/protobuf/StoreTypes.java are checked in checkstyle. We shouldn't check them in the linter. No, dev-only. Manually ran: ```bash ./dev/sbt-checkstyle ``` Closes apache#40792 from HyukjinKwon/SPARK-43141. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit ca2ddf3) Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This is follow-up for apache#39591 based on the suggestions from comment apache@1de8350#r109023188. ### Why are the changes needed? For backward compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing CI should pass Closes apache#40816 from itholic/42078-followup. Authored-by: itholic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 0457219) Signed-off-by: Hyukjin Kwon <[email protected]>
…e for a bound condition ### What changes were proposed in this pull request? In `JoinCodegenSupport#getJoinCondition`, evaluate any referenced stream-side variables before using them in the generated code. This patch doesn't evaluate the passed stream-side variables directly, but instead evaluates a copy (`streamVars2`). This is because `SortMergeJoin#codegenFullOuter` will want to evaluate the stream-side vars within a different scope than the condition check, so we mustn't delete the initialization code from the original `ExprCode` instances. ### Why are the changes needed? When a bound condition of a full outer join references the same stream-side column more than once, wholestage codegen generates bad code. For example, the following query fails with a compilation error: ``` create or replace temp view v1 as select * from values (1, 1), (2, 2), (3, 1) as v1(key, value); create or replace temp view v2 as select * from values (1, 22, 22), (3, -1, -1), (7, null, null) as v2(a, b, c); select * from v1 full outer join v2 on key = a and value > b and value > c; ``` The error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 277, Column 9: Redefinition of local variable "smj_isNull_7" ``` The same error occurs with code generated from ShuffleHashJoinExec: ``` select /*+ SHUFFLE_HASH(v2) */ * from v1 full outer join v2 on key = a and value > b and value > c; ``` In this case, the error is: ``` org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 5: Redefinition of local variable "shj_value_1" ``` Neither `SortMergeJoin#codegenFullOuter` nor `ShuffledHashJoinExec#doProduce` evaluate the stream-side variables before calling `consumeFullOuterJoinRow#getJoinCondition`. As a result, `getJoinCondition` generates definition/initialization code for each referenced stream-side variable at the point of use. If a stream-side variable is used more than once in the bound condition, the definition/initialization code is generated more than once, resulting in the "Redefinition of local variable" error. In the end, the query succeeds, since Spark disables wholestage codegen and tries again. (In the case other join-type/strategy pairs, either the implementations don't call `JoinCodegenSupport#getJoinCondition`, or the stream-side variables are pre-evaluated before the call is made, so no error happens in those cases). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes apache#40766 from bersprockets/full_join_codegen_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 119ec5b) Signed-off-by: Hyukjin Kwon <[email protected]>
… group by clause ### What changes were proposed in this pull request? Fix a correctness bug for scalar subqueries with COUNT and a GROUP BY clause, for example: ``` create view t1(c1, c2) as values (0, 1), (1, 2); create view t2(c1, c2) as values (0, 2), (0, 3); select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from t1; -- Correct answer: [(0, 1, 2), (1, 2, null)] +---+---+------------------+ |c1 |c2 |scalarsubquery(c1)| +---+---+------------------+ |0 |1 |2 | |1 |2 |0 | +---+---+------------------+ ``` This is due to a bug in our "COUNT bug" handling for scalar subqueries. For a subquery with COUNT aggregate but no GROUP BY clause, 0 is the correct output on empty inputs, and we use the COUNT bug handling to construct the plan that yields 0 when there were no matched rows. But when there is a GROUP BY clause then NULL is the correct output (i.e. there is no COUNT bug), but we still incorrectly use the COUNT bug handling and therefore incorrectly output 0. Instead, we need to only apply the COUNT bug handling when the scalar subquery had no GROUP BY clause. To fix this, we need to track whether the scalar subquery has a GROUP BY, i.e. a non-empty groupingExpressions for the Aggregate node. This need to be checked before subquery decorrelation, because that adds the correlated outer refs to the group-by list so after that the group-by is always non-empty. We save it in a boolean in the ScalarSubquery node until later when we rewrite the subquery into a join in constructLeftJoins. This is a long-standing bug. This bug affected both the current DecorrelateInnerQuery framework and the old code (with spark.sql.optimizer.decorrelateInnerQuery.enabled = false), and this PR fixes both. ### Why are the changes needed? Fix a correctness bug. ### Does this PR introduce _any_ user-facing change? Yes, fix incorrect query results. ### How was this patch tested? Add SQL tests and unit tests. (Note that there were 2 existing unit tests for queries of this shape, which had the incorrect results as golden results.) Closes apache#40811 from jchen5/count-bug. Authored-by: Jack Chen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c1a02e7) Signed-off-by: Wenchen Fan <[email protected]>
… value for unmatched row ### What changes were proposed in this pull request? When doing an outer join with joinWith on DataFrames, unmatched rows return Row objects with null fields instead of a single null value. This is not a expected behavior, and it's a regression introduced in [this commit](apache@cd92f25). This pull request aims to fix the regression, note this is not a full rollback of the commit, do not add back "schema" variable. ``` case class ClassData(a: String, b: Int) val left = Seq(ClassData("a", 1), ClassData("b", 2)).toDF val right = Seq(ClassData("x", 2), ClassData("y", 3)).toDF left.joinWith(right, left("b") === right("b"), "left_outer").collect ``` ``` Wrong results (current behavior): Array(([a,1],[null,null]), ([b,2],[x,2])) Correct results: Array(([a,1],null), ([b,2],[x,2])) ``` ### Why are the changes needed? We need to address the regression mentioned above. It results in unexpected behavior changes in the Dataframe joinWith API between versions 2.4.8 and 3.0.0+. This could potentially cause data correctness issues for users who expect the old behavior when using Spark 3.0.0+. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test (use the same test in previous [closed pull request](apache#35140), credit to Clément de Groc) Run sql-core and sql-catalyst submodules locally with ./build/mvn clean package -pl sql/core,sql/catalyst Closes apache#40755 from kings129/encoder_bug_fix. Authored-by: --global <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 74ce620) Signed-off-by: Wenchen Fan <[email protected]>
…d on database URI ### What changes were proposed in this pull request? This reverts apache#36625 and its followup apache#38321 . ### Why are the changes needed? External table location can be arbitrary and has no connection with the database location. It can be wrong to qualify the external table location based on the database location. If a table written by old Spark versions does not have a qualified location, there is no way to restore it as the information is already lost. People can manually fix the table locations assuming they are under the same HDFS cluster with the database location, by themselves. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes apache#40871 from cloud-fan/minor. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit afd9e2c) Signed-off-by: Hyukjin Kwon <[email protected]>
…aFrame.offset` ### What changes were proposed in this pull request? fix the incorrect doc ### Why are the changes needed? the description of parameter `num` is incorrect, it actually describes the `num` in `DataFrame.{limit, tail}` ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? existing UT Closes apache#40872 from zhengruifeng/fix_doc_offset. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit a28d2d7) Signed-off-by: Hyukjin Kwon <[email protected]>
…iables ### What changes were proposed in this pull request? Add a comment explaining a tricky situation involving the evaluation of stream-side variables. This is a follow-up to apache#40766. ### Why are the changes needed? Make the code more clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes apache#40881 from bersprockets/SPARK-43113_followup. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6d4ed13) Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This patch fixes a minor issue in the code where for SQL Commands the plan metrics are not sent to the client. In addition, it renames a method to make clear that the method does not actually send anything but only creates the response object. ### Why are the changes needed? Clarity ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes apache#40899 from grundprinzip/fix_sql_stats. Authored-by: Martin Grund <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]> (cherry picked from commit 9d05053) Signed-off-by: Ruifeng Zheng <[email protected]>
…mal columns This is a followup of apache#39596 to fix more corner cases. It ignores the special column flag that requires qualified access for normal output attributes, as the flag should be effective only to metadata columns. It's very hard to make sure that we don't leak the special column flag. Since the bug has been in the Spark release for a while, there may be tables created with CTAS and the table schema contains the special flag. No new analysis test Closes apache#40961 from cloud-fan/col. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 021f02e) Signed-off-by: Wenchen Fan <[email protected]>
…ar subquery ### What changes were proposed in this pull request? Cherry pick fix COUNT(*) is null bug in correlated scalar subquery cherry pick from apache#40865 and apache#40946 ### Why are the changes needed? Fix COUNT(*) is null bug in correlated scalar subquery in branch 3.4 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. Closes apache#40977 from Hisoka-X/count_bug. Lead-authored-by: Hisoka <[email protected]> Co-authored-by: Jack Chen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…s timezone ### What changes were proposed in this pull request? Casting between Timestamp and TimestampNTZ requires a timezone since the timezone id is used in the evaluation. This PR updates the method `Cast.needsTimeZone` to include the conversion between Timestamp and TimestampNTZ. As a result: * Casting between Timestamp and TimestampNTZ is considered as unresolved unless the timezone is defined * Canonicalizing cast will include the time zone ### Why are the changes needed? Timezone id is used in the evaluation between Timestamp and TimestampNTZ, thus we should mark such conversion as "needsTimeZone" ### Does this PR introduce _any_ user-facing change? No. It is more like a fix for potential bugs that the casting between Timestamp and TimestampNTZ is marked as resolved before resolving the timezone. ### How was this patch tested? New UT Closes apache#41010 from gengliangwang/fixCast. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit a1c7035) Signed-off-by: Gengliang Wang <[email protected]>
…ERT actions ### What changes were proposed in this pull request? This PR updates column DEFAULT assignment to add missing values for MERGE INSERT actions. This brings the behavior to parity with non-MERGE INSERT commands. * It also adds a small convenience feature where if the provided default value is a literal of a wider type than the target column, but the literal value fits within the narrower type, just coerce it for convenience. For example, `CREATE TABLE t (col INT DEFAULT 42L)` returns an error before this PR because `42L` has a long integer type which is wider than `col`, but after this PR we just coerce it to `42` since the value fits within the short integer range. * We also add the `SupportsCustomSchemaWrite` interface which tables may implement to exclude certain pseudocolumns from consideration when resolving column DEFAULT values. ### Why are the changes needed? These changes make column DEFAULT values more usable in more types of situations. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds new unit test coverage. Closes apache#40996 from dtenedor/merge-actions. Authored-by: Daniel Tenedorio <[email protected]> Signed-off-by: Gengliang Wang <[email protected]> (cherry picked from commit 3a0e6bd) Signed-off-by: Gengliang Wang <[email protected]>
…eryStageExec ### What changes were proposed in this pull request? This PR fixes compute stats when `BaseAggregateExec` nodes above `QueryStageExec`. For aggregation, when the number of shuffle output rows is 0, the final result may be 1. For example: ```sql SELECT count(*) FROM tbl WHERE false; ``` The number of shuffle output rows is 0, and the final result is 1. Please see the [UI](https://github.com/apache/spark/assets/5399861/9d9ad999-b3a9-433e-9caf-c0b931423891). ### Why are the changes needed? Fix data issue. `OptimizeOneRowPlan` will use stats to remove `Aggregate`: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.OptimizeOneRowPlan === !Aggregate [id#5L], [id#5L] Project [id#5L] +- Union false, false +- Union false, false :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) :- LogicalQueryStage Aggregate [sum(id#0L) AS id#5L], HashAggregate(keys=[], functions=[sum(id#0L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) +- LogicalQueryStage Aggregate [sum(id#18L) AS id#12L], HashAggregate(keys=[], functions=[sum(id#18L)]) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#41576 from wangyum/SPARK-44040. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 55ba63c) Signed-off-by: Yuming Wang <[email protected]>
### What changes were proposed in this pull request? Bump snappy-java from 1.1.10.0 to 1.1.10.1. ### Why are the changes needed? This mostly is a security version, the notable changes are CVE fixing. - CVE-2023-34453 Integer overflow in shuffle - CVE-2023-34454 Integer overflow in compress - CVE-2023-34455 Unchecked chunk length Full changelog: https://github.com/xerial/snappy-java/releases/tag/v1.1.10.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41616 from pan3793/SPARK-44070. Authored-by: Cheng Pan <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 0502a42) Signed-off-by: Yuming Wang <[email protected]>
### What changes were proposed in this pull request? This PR fixes all dead links for K8s doc. ### Why are the changes needed? <img width="797" alt="image" src="https://github.com/apache/spark/assets/5399861/3ba3f048-776c-42e6-b455-86e90b6ef22f"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes apache#41635 from wangyum/kubernetes. Authored-by: Yuming Wang <[email protected]> Signed-off-by: Yuming Wang <[email protected]> (cherry picked from commit 1ff6704) Signed-off-by: Yuming Wang <[email protected]>
…xpression ### What changes were proposed in this pull request? The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code. In fact, we should adopt the hash code for each element in `children`. Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them. This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`. ### Why are the changes needed? Improve the hash code for some DS V2 Expression. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? N/A Closes apache#41543 from beliefer/SPARK-44018. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 8c84d2c) Signed-off-by: Wenchen Fan <[email protected]>
catalinii
approved these changes
Oct 10, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is the PR for spark 3.4.1