State source value #4

… state variables

…urces-parquet.md` doc ### What changes were proposed in this pull request? This PR aims to update parquet version in `sql-data-sources-parquet.md` doc. ### Why are the changes needed? In order to keep consistent with the version of parquet in dependencies. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA and manually confirmed that the new links can be opened. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47242 from wayneguow/SPARK-48177. Authored-by: Wei Guo <[email protected]> Signed-off-by: yangjie01 <[email protected]>

…TIES ...` in v1 and v2 ### What changes were proposed in this pull request? The pr aims to: - align the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` in v1 and v2. (this means that in the v1, regardless of whether `IF EXISTS` is specified or not, when unset a `non-existent` property, it is `ignored` and no longer `fails`.) - update the description of `ALTER TABLE ... UNSET TBLPROPERTIES ...` in the doc `docs/sql-ref-syntax-ddl-alter-table.md`. - unify v1 and v2 `ALTER TABLE ... UNSET TBLPROPERTIES ...` tests. - Add the following `scenario` for `ALTER TABLE ... SET TBLPROPERTIES ...` testing A.`table to alter does not exist` B.`alter table set reserved properties` ### Why are the changes needed? - align the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` in v1 and v2, avoid confusing end-users. - to improve test coverage. - align with other similar tests, eg: `AlterTableSetTblProperties*` ### Does this PR introduce _any_ user-facing change? Yes, in the `v1`, regardless of whether `IF EXISTS` is specified or not, when unset a `non-existent` property, it is `ignored` and no longer `fails` ### How was this patch tested? Update some UT & Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47097 from panbingkun/alter_unset_table. Authored-by: panbingkun <[email protected]> Signed-off-by: yangjie01 <[email protected]>

### What changes were proposed in this pull request? The listener test in `ClientStreamingQuerySuite` is flaky. For client side listeners, the terminated events might take a while before arriving to the client. This test is currently flaky, example: https://github.com/anishshri-db/spark/actions/runs/9785389228/job/27018350836 This PR tries to deflake it by waiting for a longer time. ### Why are the changes needed? Deflake test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test only change ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47205 from WweiL/deflake-listener-client-scala. Authored-by: Wei Liu <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…causes batch with no files to be processed ### What changes were proposed in this pull request? This is a followup to a bug identified from apache#45362. When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked. This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0 ### Why are the changes needed? Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users. ### Does this PR introduce _any_ user-facing change? Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0 ### How was this patch tested? New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase. Lead-authored-by: ragnarok56 <[email protected]> Co-authored-by: Kevin Nacios <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…t fail if the session is already closed by the server ### What changes were proposed in this pull request? Improve the error handling of the `stop()` API in the `SparkSesion` class to not throw if there is any error related to releasing a session or closing the underlying GRPC channel. Both are best effort. In the case of Pyspark, do not fail if the local Spark Connect service cannot be stopped. ### Why are the changes needed? In some cases, the Spark Connect Service will terminate the session, usually because the underlying cluster or driver has restarted. In the cases, calling stop() throws an error which is unactionable. However, stop() still needs to be called in order to reset the active session. Further, the stop() API should be idempotent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Attached unit tests. Confirmed that removing the code changes results in the tests failing (as expected). ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47215 from nija-at/session-stop. Authored-by: Niranjan Jayakar <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Simplify `percentile` functions ### Why are the changes needed? existing implementations are unnecessarily complicated ### Does this PR introduce _any_ user-facing change? No, minor refactoring ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47225 from zhengruifeng/func_refactor_1. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…Spark docstrings ### What changes were proposed in this pull request? This PR unifies the 'See Also' section formatting across PySpark docstrings and fixes some invalid references. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47240 from allisonwang-db/spark-48825-also-see-docs. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes: - `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase: - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement. - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements. - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node. - `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed. Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc. More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well). ### Why are the changes needed? The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL. Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL. ### Does this PR introduce _any_ user-facing change? No. This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? There are tests for newly introduced parser changes: - `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes. - `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration). ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47026 from davidm-db/sql_scripting_interpreter. Authored-by: David Milicevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? In this pull request i propose to change default ISO pattern we use for formatting timestamps when we are writing to json,xml and/or csv as well as when to_(xml|json|csv) is used. Older timestamps sometimes have offsets that contain seconds part as well. Current default formatting used is omitting seconds hence providing wrong results. e.g. ``` sql("SET spark.sql.session.timeZone=America/Los_Angeles") sql("SELECT to_json(struct(CAST('1800-01-01T00:00:00+00:00' AS TIMESTAMP) AS ts))").show(false) {"ts":"1799-12-31T16:07:02.000-07:52"} ``` ### Why are the changes needed? This is correctness issue. ### Does this PR introduce _any_ user-facing change? Yes, users will now see different results for older timestamps (correct ones). ### How was this patch tested? Tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47177 from milastdbx/dev/milast/fixJsonTimestampHandling. Authored-by: milastdbx <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…on-based profiling ### What changes were proposed in this pull request? Introduces `spark.profile.render` for SparkSession-based profiling. It uses [`flameprof`](https://github.com/baverman/flameprof/) for the default renderer. ``` $ pip install flameprof ``` run `pyspark` on Jupyter notebook: ```py from pyspark.sql.functions import pandas_udf spark.conf.set("spark.sql.pyspark.udf.profiler", "perf") df = spark.range(10) pandas_udf("long") def add1(x): return x + 1 added = df.select(add1("id")) added.show() spark.profile.render(id=2) ``` <img width="1103" alt="pyspark-udf-profile" src="https://github.com/apache/spark/assets/506656/795972e8-f7eb-4b89-89fc-3d8d18b86541"> On CLI, it will return `svg` source string. ```py '<?xml version="1.0" standalone="no"?>\n<!DOCTYPE svg ... ``` Currently only `renderer="flameprof"` for `type="perf"` is supported as a builtin renderer. You can also pass an arbitrary renderer. ```py def render_perf(stats): ... spark.profile.render(id=2, type="perf", renderer=render_perf) def render_memory(codemap): ... spark.profile.render(id=2, type="memory", renderer=render_memory) ``` ### Why are the changes needed? Better debuggability. ### Does this PR introduce _any_ user-facing change? Yes, `spark.profile.render` will be available. ### How was this patch tested? Added/updated the related tests, and manually. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47202 from ueshin/issues/SPARK-48798/render. Authored-by: Takuya Ueshin <[email protected]> Signed-off-by: Takuya Ueshin <[email protected]>

### What changes were proposed in this pull request? The pr aims to eliminating warnings for panda: `<string>:5: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.` ### Why are the changes needed? Only eliminating warnings for panda https://github.com/panbingkun/spark/actions/runs/9795675050/job/27048513673 <img width="856" alt="image" src="https://github.com/apache/spark/assets/15246973/ea70e922-897e-450f-b150-3d38d7f20930"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47222 from panbingkun/remove_pandas_warning. Authored-by: panbingkun <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? We can eliminate the use of mutable.ArrayBuffer by using `flatmap`. ### Why are the changes needed? Code simplification and optimization. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47185 from amaliujia/followup_cte. Lead-authored-by: Rui Wang <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Introducing virtual column family to RocksDB. We attach an 2-byte-Id prefix as column family identifier for each of the key row that is put into RocksDB. The encoding and decoding of the virtual column family prefix happens at the `RocksDBKeyEncoder` layer as we can pre-allocate extra 2 bytes and avoid additional memcpy. - Remove Physical Column Family related codes as this becomes potentially dead code till some caller starts using this. - Remove `useColumnFamilies` from `StateStoreChangelogV2` API. ### Why are the changes needed? Currently within the scope of the arbitrary stateful API v2 (transformWithState) project, each state variable is stored inside one [physical column family](https://github.com/facebook/rocksdb/wiki/Column-Families) within the RocksDB state store instance. Column families are also used to implement secondary indexes for various features. Each physical column family has its own memtables, creates its own SST files, and handles compaction independently on those independent SST files. When the number of operations to RocksDB is relatively small and the number of column families is relatively large, the overhead of handling small SST files becomes high, especially since all of these have to be uploaded in the snapshot dir and referenced in the metadata file for the uploaded RocksDB snapshot. Using prefix to manage different key spaces / virtual column family could reduce such overheads. ### Does this PR introduce _any_ user-facing change? No. If `useColumnFamilies` are set to true in the `StateStore.init()`, virtual column family will be used. ### How was this patch tested? Unit tests in `RocksDBStateStoreSuite`, and integration tests in `TransformWithStateSuite`. Moved test suites in `RocksDBSuite` into `RocksDBStateStoreSuite` because some previous verification functions are now moved into `RocksDBStateProvider` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47107 from jingz-db/virtual-col-family. Lead-authored-by: jingz-db <[email protected]> Co-authored-by: Jing Zhan <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? The pr aims to upgrade `kubernetes-client` from `6.13.0` to `6.13.1`. ### Why are the changes needed? - The full release notes: https://github.com/fabric8io/kubernetes-client/releases/tag/v6.13.1 - The newest version fixed some bug, eg: Fix fabric8io/kubernetes-client#6059: Swallow rejected execution from internal usage of the informer executor Fix fabric8io/kubernetes-client#6068: KubernetesMockServer provides incomplete Configuration while creating test Config for KubernetesClient Fix fabric8io/kubernetes-client#6085: model getters have same annotations as fields (breaks native) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47206 from panbingkun/SPARK-48801. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… per transform, not once per row ### What changes were proposed in this pull request? apache#11536 added a `binary` toggle parameter to `CountVectorizer`, but the parameter evaluation occurs inside of the vectorizer UDF itself: this causes expensive parameter reading to occur once-per-row instead of once-per-transform. This PR addresses this issue by updating the code to only read the parameter once, similar to what was already being done for the `minTf` parameter. ### Why are the changes needed? Address a performance issue. I spotted this issue when I saw the stack ```scala [...] at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:204) at scala.collection.IndexedSeqOptimized.exists(IndexedSeqOptimized.scala:49) at scala.collection.IndexedSeqOptimized.exists$(IndexedSeqOptimized.scala:49) at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:198) at org.apache.spark.ml.param.Params.hasParam(params.scala:701) at org.apache.spark.ml.param.Params.hasParam$(params.scala:700) at org.apache.spark.ml.PipelineStage.hasParam(Pipeline.scala:42) at org.apache.spark.ml.param.Params.shouldOwn(params.scala:856) at org.apache.spark.ml.param.Params.get(params.scala:739) at org.apache.spark.ml.param.Params.get$(params.scala:738) at org.apache.spark.ml.PipelineStage.get(Pipeline.scala:42) at org.apache.spark.ml.param.Params.getOrDefault(params.scala:759) at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:757) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params.$(params.scala:766) at org.apache.spark.ml.param.Params.$$(params.scala:766) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.feature.CountVectorizerModel.$anonfun$transform$1(CountVectorizer.scala:326) at org.apache.spark.ml.feature.CountVectorizerModel$$Lambda$12153/1200761496.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source) [...] ``` while investigating an unrelated issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47258 from JoshRosen/CountVectorizer-conf. Authored-by: Josh Rosen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

… with ParquetWriteSupport ### What changes were proposed in this pull request? Kind of follow-up of apache#44275, this PR aligned 2 similar code paths with different error messages into one. ```java 24/07/03 16:29:01 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: [INTERNAL_ERROR] Unsupported data type VarcharType(64). SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) ``` ```java org.apache.spark.SparkUnsupportedOperationException: VarcharType(64) is not supported yet. at org.apache.spark.sql.errors.QueryExecutionErrors$.dataTypeUnsupportedYetError(QueryExecutionErrors.scala:993) at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.newConverter(OrcSerializer.scala:209) at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.$anonfun$converters$2(OrcSerializer.scala:35) at scala.collection.immutable.List.map(List.scala:247) ``` ### Why are the changes needed? improvement ### Does this PR introduce _any_ user-facing change? No, users shouldn't face such errors in regular cases. ### How was this patch tested? passing existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47208 from yaooqinn/SPARK-48803. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR makes additional Scala logging migrations to comply with the scala style changes in apache#46947 ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested by ensuring dev/scalastyle checks pass ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47256 from asl3/morestructuredloggingmigrations. Authored-by: Amanda Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

…From check for output committer class configrations ### What changes were proposed in this pull request? This pull request proposed adding a checker for class values provided by users in `spark.sql.sources.outputCommitterClass` and `spark.sql.parquet.output.committer.class` to make sure the given class is visible from the classpath and a subclass of `org.apache.hadoop.mapreduce.OutputCommitter` ### Why are the changes needed? Ensure that an invalid configuration results in immediate application or query failure rather than failing late during setupJob. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47209 from yaooqinn/SPARK-48804. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…eness` for large query plans ### What changes were proposed in this pull request? This PR rewrites `LogicalPlanIntegrity.hasUniqueExprIdsForOutput` to only traverse the query plan once and avoids expensive Scala collections operations like `.flatten`, `.groupBy`, and `.distinct`. ### Why are the changes needed? Speeds up query compilation when plan validation is enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Made sure existing UTs pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47170 from kelvinjian-db/SPARK-48771-speed-up. Authored-by: Kelvin Jiang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Directly call `IntervalUtils.castStringToDTInterval/castStringToYMInterval` instead of creating Cast expressions to evaluate. - Benchmarks indicated a 10% time-saving. - Bad record recording might not work if the cast handles the exceptions early ### Why are the changes needed? - pref improvement - Bugfix for bad record recording ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing existing tests and benchmark tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47227 from yaooqinn/SPARK-48816. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? This PR amis to upgrade `fasterxml.jackson` from 2.17.1 to 2.17.2. ### Why are the changes needed? There are some bug fixes about [Databind](https://github.com/FasterXML/jackson-databind): [apache#4561](FasterXML/jackson-databind#4561): Issues using jackson-databind 2.17.1 with Reactor (wrt DeserializerCache and ReentrantLock) [apache#4575](FasterXML/jackson-databind#4575): StdDelegatingSerializer does not consider a Converter that may return null for a non-null input [apache#4577](FasterXML/jackson-databind#4577): Cannot deserialize value of type java.math.BigDecimal from String "3." (not a valid representation) [apache#4595](FasterXML/jackson-databind#4595): No way to explicitly disable wrapping in custom annotation processor [apache#4607](FasterXML/jackson-databind#4607): MismatchedInput: No Object Id found for an instance of X to assign to property 'id' [apache#4610](FasterXML/jackson-databind#4610): DeserializationFeature.FAIL_ON_UNRESOLVED_OBJECT_IDS does not work when used with Polymorphic type handling The full release note of 2.17.2: https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.17.2 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47241 from wayneguow/upgrade_jackson. Authored-by: Wei Guo <[email protected]> Signed-off-by: yangjie01 <[email protected]>

…ee_disk_space_container` ### What changes were proposed in this pull request? This PR removed the check for the existence of `./dev/free_disk_space_container` before execution, because `./dev/free_disk_space_container` has already been backported to branch-3.4 and branch-3.5 through apache#45624 and apache#43381, so there is no need to check its existence before execution. ### Why are the changes needed? Remove unnecessary existence check for `./dev/free_disk_space_container`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47263 from LuciferYang/SPARK-48840. Authored-by: yangjie01 <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…en getStruct returns null ### What changes were proposed in this pull request? The getStruct() method used in `MergingSessionIterator.initialize` could return a null value. When that happens, the copy() called upon it throws a NullPointerException. We see an exception thrown there: ``` ava.lang.NullPointerException: <Redacted Exception Message> at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.initialize(MergingSessionsIterator.scala:121) at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.<init>(MergingSessionsIterator.scala:130) at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1(MergingSessionsExec.scala:93) at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1$adapted(MergingSessionsExec.scala:72) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:920) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:920) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406) at org.apache.spark.rdd.RDD.iterator(RDD.scala:373) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201) at org.apache.spark.scheduler.Task.doRunTask(Task.scala:189) at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:154) at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129) at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:148) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.scheduler.Task.run(Task.scala:101) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:984) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:987) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:879) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) ``` It is still not clear why that field could be null, but in general Spark should not throw NPEs. So this PR purposes to wrap it with SparkException.internalError with more details. ### Why are the changes needed? Improvemtns ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a hard-to repro issue. The change should not cause any harm. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47134 from WweiL/SPARK-48743-mergingSessionIterator-null-init. Authored-by: Wei Liu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? The pr aims to fix some typos in some docs, includes: `docs/sql-ref-syntax-qry-star.md`, `docs/running-on-kubernetes.md` and `connector/profiler/README.md`. ### Why are the changes needed? https://spark.apache.org/docs/4.0.0-preview1/sql-ref-syntax-qry-star.html In some `sql examples` in the doc `docs/sql-ref-syntax-qry-star.md`, `Unicode Character 'SINGLE QUOTATION MARK'` was used, which resulted in the end-user being unable to execute successfully after `copy-paste`, eg: <img width="660" alt="image" src="https://github.com/apache/spark/assets/15246973/055aa0a8-602e-4ea7-a065-c8e0353c6fb3"> ### Does this PR introduce _any_ user-facing change? Yes, the end-users will face more user-friendly docs. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47261 from panbingkun/fix_typo_docs. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…with Spark Classic ### What changes were proposed in this pull request? I think there are two issues regarding the default column name of `cast`: 1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic, ``` scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- id: string (nullable = false) |-- CAST(1 AS STRING): string (nullable = false) |-- id: long (nullable = false) |-- CAST(1 AS BIGINT): long (nullable = false) ``` 2, the column name is not consistent between Spark Connect and Spark Classic. This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212 ### Why are the changes needed? the default column name is not consistent with the spark classic ### Does this PR introduce _any_ user-facing change? yes, spark classic: ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` spark connect (before): ``` In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +---------+---+---+-----+ |X'313233'|123|123| 123| +---------+---+---+-----+ | 123|123|123|123.0| +---------+---+---+-----+ ``` spark connect (after): ``` In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show() +-------------------------+-------------------+-------------------+-------------------+ |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)| +-------------------------+-------------------+-------------------+-------------------+ | 123| 123| 123| 123.0| +-------------------------+-------------------+-------------------+-------------------+ ``` ### How was this patch tested? added test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47249 from zhengruifeng/py_fix_cast. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…yntax-ddl-alter-table.md` ### What changes were proposed in this pull request? The pr is following up apache#47156, aims to - add `CLUSTER BY` to doc `sql-ref-syntax-ddl-alter-table.md` - move parser tests from `o.a.s.s.c.p.DDLParserSuite` to `AlterTableClusterByParserSuite` - use `checkError` to check exception in `o.a.s.s.e.c.AlterTableClusterBySuiteBase` ### Why are the changes needed? - Enable the doc `sql-ref-syntax-ddl-alter-table.md` to cover new syntax `ALTER TABLE ... CLUSTER BY ...`. - Align with other similar tests, eg: AlterTableRename* ### Does this PR introduce _any_ user-facing change? Yes, Make end-users can query the explanation of `CLUSTER BY` through the doc `sql-ref-syntax-ddl-alter-table.md`. ### How was this patch tested? Updated UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47254 from panbingkun/SPARK-48760_FOLLOWUP. Authored-by: panbingkun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? DECLARE agg = 'max'; DECLARE col = 'c1'; DECLARE tab = 'T'; WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab); -- OR WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)), T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd')) SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T'); Currently we don't support Identifier clause as part of CTE reference. ### Why are the changes needed? Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables. ### Does this PR introduce _any_ user-facing change? It contains user facing changes in sense that identifier clause as cte reference will now be supported. ### How was this patch tested? Added tests as part of this PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47180 from nebojsa-db/SPARK-46625. Authored-by: Nebojsa Savic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Add jobGroupId to SparkListenerSQLExecutionStart ### Why are the changes needed? JobGroupId can be used to combine jobs within the same group. This is going to be useful in the listener so it makes the job grouping easy to do ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit Test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47092 from gjxdxh/gjxdxh/SPARK-48716. Authored-by: Lingkai Kong <[email protected]> Signed-off-by: Josh Rosen <[email protected]>

…ms in func docs in `builtin.py` ### What changes were proposed in this pull request? Fix the incorrect naming and missing params in func docs in `builtin.py`. ### Why are the changes needed? Some params' name in `pySpark` docs are wrong, for example: ![image](https://github.com/apache/spark/assets/16032294/af0ca3c9-b085-4364-8cfc-814371f21b4b) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47269 from wayneguow/py_docs. Authored-by: Wei Guo <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Eagerly execute union multi commands together. ### Why are the changes needed? MultiInsert is split to multiple sql executions, resulting in no exchange reuse. Reproduce sql: ``` create table wangzhen_t1(c1 int); create table wangzhen_t2(c1 int); create table wangzhen_t3(c1 int); insert into wangzhen_t1 values (1), (2), (3); from (select /*+ REPARTITION(3) */ c1 from wangzhen_t1) insert overwrite table wangzhen_t2 select c1 insert overwrite table wangzhen_t3 select c1; ``` In Spark 3.1, there is only one SQL execution and there is a reuse exchange. ![image](https://github.com/apache/spark/assets/17894939/5ff68392-aaa8-4e6b-8cac-1687880796b9) However, in Spark 3.5, it was split to multiple executions and there was no ReuseExchange. ![image](https://github.com/apache/spark/assets/17894939/afdb14b6-5007-4923-802d-535149974ecf) ![image](https://github.com/apache/spark/assets/17894939/0d60e8db-9da7-4906-8d07-2b622b55e6ab) ### Does this PR introduce _any_ user-facing change? yes, multi inserts will executed in one execution. ### How was this patch tested? added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47224 from wForget/SPARK-48817. Authored-by: wforget <[email protected]> Signed-off-by: youxiduo <[email protected]>

…cstring ### What changes were proposed in this pull request? This PR adds and "Examples" section header to `format_number` docstring. ### Why are the changes needed? To improve the documentation. ### Does this PR introduce any user-facing change? No changes in behavior are introduced. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47237 from thomhart31/docs-format_number. Lead-authored-by: thomas.hart <[email protected]> Co-authored-by: Thomas Hart <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options. This is a continuation of apache#41683 based on cloud-fan's nice suggestion. That was itself a continuation of apache#34072. ### Why are the changes needed? This will allow Spark SQL to have equivalence to DataFrameReader API. For example, it is possible to specify options today to DataSources as follows via the API: ``` spark.read.format("jdbc").option("fetchSize", 0).load() ``` This pr allows an equivalent Spark SQL syntax to specify options: ``` SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test in DataSourceV2SQLSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46707 from szehon-ho/spark-36680. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This PR adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source. An example usage: ``` .format("statestore") .option("readChangeFeed", true) .option("changeStartBatchId", 5) #required .option("changeEndBatchId", 10) #not required, default: latest batch Id available ``` _Note that this mode does not support the option "joinSide"._ The current state reader can only return the entire state at a specific version. If an error occurs related to state, knowing the change of state across versions to find out at which version state starts to go wrong is important for debugging purposes. No. Adds a new test suite `StateDataSourceChangeDataReadSuite` that includes 1) testing input error 2) testing new API added 3) integration test. No. Closes apache#47188 from eason-yuchen-liu/readStateChange. Lead-authored-by: Yuchen Liu <[email protected]> Co-authored-by: Yuchen Liu <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now. ### Why are the changes needed? improve csv with spark sql types ### Does this PR introduce _any_ user-facing change? Yes, but it's from failures to success with binary csv tables ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47212 from yaooqinn/SPARK-48807. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…trib-*` in `dev/requirements.txt` with `sphinx==4.5.0` ### What changes were proposed in this pull request? This PR aims to set the upper bound version of 'sphinxcontrib-*' in `dev/requirements.txt` with `sphinx==4.5.0`. ### Why are the changes needed? Currently, if Spark developers use the command `pip install --upgrade -r dev/requirements.txt` directly to install python-related dependencies, the automatically installed `sphinxcontrib-*` versions don't match `sphinx==4.5.0`. Refered to the issue: sphinx-doc/sphinx#11890. Then they execute the `make html` command for building pySpark docs and the following error will appear: <img width="1211" alt="image" src="https://github.com/apache/spark/assets/16032294/719c4b1d-9b7d-4ba9-89c5-ec3c0dc4572f"> This problem has been avoided through pinning `sphinxcontrib-*` in workflows of Spark GA: ![image](https://github.com/apache/spark/assets/16032294/bf4906f1-a76d-47bd-af42-f263537f371c) So we can do the similar way by setting the upper bound version of in `requirements.txt` and it will be helpful for Spark developers when making pySpark docs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47270 from wayneguow/py_require. Authored-by: Wei Guo <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This PR added documents for missing CSV options, including `delimiter` as an alternative to `sep`, `charset` as an alternative to `encoding`, `codec` as an alternative to `compression`, and `timeZone`, excluding `columnPruning` which falls back to an internal SQL config. ### Why are the changes needed? improvement for user guide ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ![image](https://github.com/apache/spark/assets/8326978/d8ff888b-cafa-44e6-ab74-7bf69702a267) ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47278 from yaooqinn/SPARK-48854. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? In order to resolve the named parameters on the subtree, BindParameters recurses into the subtrees and tries to match the pattern with the named parameters. If there's no named parameter in the current level, the rule tries to return the unchanged plan. However, instead of returning the current plan object, the rule always returns the captured root plan node, leading into the infinite recursion. ### Why are the changes needed? Infinite recursion with the named parameters and the global limit. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47271 from nemanja-boric-databricks/fix-bind. Lead-authored-by: Nemanja Boric <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…gistration overhead using CopyOnWriteArrayList ### What changes were proposed in this pull request? This PR proposes to use the `ArrayBuffer` together with the read/write lock rather than `CopyOnWriteArrayList` for `TaskMetrics._externalAccums`. ### Why are the changes needed? Fix the perf regression that caused by the accumulators registration overhead using `CopyOnWriteArrayList`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47197 from Ngone51/SPARK-48791. Authored-by: Yi Wu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? SPARK-46115 SPARK-46220 started the work for building a consistent charset list for spark, the PR brings it to CSV options ### Why are the changes needed? To make the charset list consistent across different platforms/JDKs ### Does this PR introduce _any_ user-facing change? Yes, legacyCharsets is provided ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47280 from yaooqinn/SPARK-48857. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…nt from default allocation batch size ### What changes were proposed in this pull request? This PR aims to make `ExecutorPodsAllocatorSuite` independent from default allocation batch size. ### Why are the changes needed? To make test assumption explicitly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47279 from dongjoon-hyun/SPARK-48855. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? This PR adds an example to `countDistinct` docstring demonstrate `count_distinct` and `countDistinct` provide same functionality. ### Why are the changes needed? To improve the documentation. ### Does this PR introduce any user-facing change? No changes in behavior are introduced. ### How was this patch tested? Existing tests. Was this patch authored or co-authored using generative AI tooling? No Closes apache#47235 from thomhart31/docs-countDistinct. Lead-authored-by: thomas.hart <[email protected]> Co-authored-by: Thomas Hart <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? This PR edits grammar in `pyspark.sql.functions.lag` docstring. ### Why are the changes needed? To improve the documentation. ### Does this PR introduce any user-facing change? No changes in behavior are introduced. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47236 from thomhart31/docs-lag. Authored-by: thomas.hart <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…DATASOURCE_FOR_DIRECT_QUERY when path is empty ### What changes were proposed in this pull request? When running sql on valid datasource files directly, if the given path is an empty string, we currently report UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY, which claims the datasource is invalid. The reason is that the `hadoop.Path` class can not be constructed with empty strings and we wrap `IAE` with UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY. In this PR, we check the path ahead to avoid this ambiguous error message ### Why are the changes needed? trivial bugfix, although this error rarely occurs in REPL environments but might still get a chance to happen when using the query with string interpolation. ### Does this PR introduce _any_ user-facing change? Yes, different error class ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47267 from yaooqinn/SPARK-48844. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This reverts commit 8ca1822.

…in module ### What changes were proposed in this pull request? This PR proposes to move the connect server to builtin module. From: ``` connector/connect/server connector/connect/common ``` To: ``` connect/server connect/common ``` ### Why are the changes needed? So the end users do not have to specify `--packages` when they start the Spark Connect server. Spark Connect client remains as a separate module. This was also pointed out in apache#39928 (comment). ### Does this PR introduce _any_ user-facing change? Yes, users don't have to specify `--packages` anymore. ### How was this patch tested? CI in this PR should verify them. Also manually tested several basic commands such as: - Maven build - SBT build - Running basic Scala client commands ```bash cd connector/connect bin/spark-connect bin/spark-connect-scala-client ``` - Running basic PySpark client commands ```bash bin/pyspark --remote local ``` - Connecting to the server launched by `./sbin/start-connect-server.sh` ```bash ./sbin/start-connect-server.sh bin/pyspark --remote "sc://localhost" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47157 from HyukjinKwon/move-connect-server-builtin. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This is a test dependency update to use `ws` 8.18.0. ### Why are the changes needed? Although Apache Spark binary is not affected by this, this PR aims to resolve this alert which recommends `ws` versions 8.17.1+. - https://github.com/apache/spark/security/dependabot/95 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the new dependency. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47287 from dongjoon-hyun/SPARK-48860. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…INFO level is not enabled ### What changes were proposed in this pull request? Avoid calling `_proto_to_string` when INFO level is not enabled. ### Why are the changes needed? We should avoid `_proto_to_string` as it takes long time, although the result is not used if INFO level is not enabled. E.g., ```py from functools import reduce df = createDataFrame() def project_schema(n=100): return reduce(lambda df, _: df.select(F.col("a"), F.col("b"), F.col("c"), F.col("d")), range(n), df).schema profile(project_schema) ``` <img width="1104" alt="Screenshot 2024-07-10 at 17 24 18" src="https://github.com/apache/spark/assets/506656/66c2f50e-13b8-43f0-a46c-dcad4e7bfe89"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47289 from ueshin/issues/SPARK-48862/logging. Authored-by: Takuya Ueshin <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Ignores `connect.Column.to_plan` from `with_origin`. ### Why are the changes needed? Capturing call site on `connect.Column.to_plan` takes long time when creating proto plans if there are many `connect.Column` objects, although the call sites on `connect.Column.to_plan` are not necessary. E.g., ```py from pyspark.sql import functions as F df = createDataFrame() def schema(): return df.select(*([F.col("a"), F.col("b"), F.col("c"), F.col("d")] * 10)).schema profile(schema) ``` <img width="1109" alt="Screenshot 2024-07-10 at 13 40 33" src="https://github.com/apache/spark/assets/506656/776978ce-bef9-47ef-b4a5-0d206683736d"> The total function calls / duration for this is: - before ``` 28393570 function calls (28381720 primitive calls) in 3.450 seconds ``` - after ``` 109970 function calls (98120 primitive calls) in 0.184 seconds ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47284 from ueshin/issues/SPARK-48459/query_context. Authored-by: Takuya Ueshin <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…rate ### What changes were proposed in this pull request? The pr is followuping apache#47157, to make `dev/lint-scala` error message more accurate. ### Why are the changes needed? After move from: `connector/connect/server` `connector/connect/common` to: `connect/server``connect/common` Our error message in `dev/lint-scala` should be updated synchronously. eg: <img width="709" alt="image" src="https://github.com/apache/spark/assets/15246973/d749e371-7621-4063-b512-279d0690d573"> <img width="772" alt="image" src="https://github.com/apache/spark/assets/15246973/44b80571-bdb6-40cb-9571-8b34d009b5f8"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47291 from panbingkun/SPARK-48763_FOLLOWUP. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…s out for the TransformWithStateExec operator ### What changes were proposed in this pull request? In this PR, we introduce the `StateSchemaV3` file that is used to keep track of a list of `ColumnFamilySchema` which we write from the `TransformWithState` operator. We collect the Column Family schemas from the driver, and write them out as a part of a planning rule. We will be introducing the OperatorStateMetadataV2 in the following PR: apache#47273 This will integrate with the TransformWithState operator, and rely on the schema file. ### Why are the changes needed? These changes are needed to enable schema evolution for this operator in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests and ran existing unit tests ``` [info] Run completed in 11 seconds, 673 milliseconds. [info] Total number of tests run: 4 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 43 s, completed Jun 26, 2024, 10:38:35 AM ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47104 from ericm-db/state-schema-tws. Lead-authored-by: Eric Marnadi <[email protected]> Co-authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? Previous [PR1](apache#46665) and [PR2](apache#46665) introduced parser and interpreter changes for SQL Scripting. This PR is a follow-up to introduce the concept of labels for SQL Scripting language and proposes the following changes: - Changes grammar to support labels at start and end of the compound statements. - Updates visitor functions for compound nodes in the syntax tree in AstBuilder to check if labels are present and valid. More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48529) for this task and its parent (where the design doc is uploaded as well). ### Why are the changes needed? The intent is to add support for various SQL scripting concepts like loops, leave & iterate statements. ### Does this PR introduce any user-facing change? No. This PR is among first PRs in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? There are tests for newly introduced parser changes: SqlScriptingParserSuite - unit tests for execution nodes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47146 from miland-db/sql_batch_labels. Lead-authored-by: David Milicevic <[email protected]> Co-authored-by: Milan Dankovic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…hread` in `log_communication.py` ### What changes were proposed in this pull request? This PR aims to remove deprecated `setDaemon` method call of `Thread` in `log_communication.py`. This is last one used. ### Why are the changes needed? Clean up deprecated apis. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47282 from wayneguow/remove_py_dep. Authored-by: Wei Guo <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…sing expression walking ### What changes were proposed in this pull request? Followup: small correction. ### Why are the changes needed? UTF8_BINARY_LCASE no longer exists in Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47216 from uros-db/fix-walker. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…pace` to `catalog.namespace` ### What changes were proposed in this pull request? The pr aims to change the value of `SCHEMA_NOT_FOUND` from `namespace` to `catalog.namespace`. ### Why are the changes needed? As discussing apache#47038 (comment), we should provide more friendly and clear prompt error message. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update existed UT & Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47276 from panbingkun/db_with_catalog. Authored-by: panbingkun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…rk.sql.json.enablePartialResults" enabled  ### What changes were proposed in this pull request?  This PR fixes a bug in a corner case of JSON parsing when `spark.sql.json.enablePartialResults` is enabled. When running the following query with the config set to true: ``` select from_json('{"a":"b","c":"d"}', 'array<struct<a:string, c:int>>') ``` the code would fail with ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4) (ip-10-110-51-101.us-west-2.compute.internal executor driver): java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class org.apache.spark.sql.catalyst.util.ArrayData (org.apache.spark.unsafe.types.UTF8String and org.apache.spark.sql.catalyst.util.ArrayData are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:53) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:172) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:831) at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:893) ``` The patch fixes the issue by re-throwing PartialArrayDataResultException if parsing fails in this special case. ### Why are the changes needed?  Fixes the bug that would prevent users from reading objects as arrays as introduced in SPARK-19595. This is more of a special case but it works with the flag off so it would be good to fix it when the flag is on. ### Does this PR introduce _any_ user-facing change?  Yes, but it is a bug fix so it would not have worked without this patch overall. The parsing output will be different due to the partial results improvement: Previously, we would get `null` (the partial results are disabled). With this patch and partial results enabled, this will return `Array([b, null])`. This is not specific to this patch but rather to the partial results feature in general. ### How was this patch tested?  I added a unit test. ### Was this patch authored or co-authored using generative AI tooling?  No. Closes apache#47292 from sadikovi/SPARK-48863. Authored-by: Ivan Sadikov <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… INVALID_PARAMETER_VALUE.CHARSET ### What changes were proposed in this pull request? This PR fixes hints at the error message of INVALID_PARAMETER_VALUE.CHARSET. The current error message does not enumerate all valid charsets, e.g. UTF-32. This PR parameterizes it to fix this issue. ### Why are the changes needed? Bugfix, the hint w/ charsets missing is not helpful ### Does this PR introduce _any_ user-facing change? Yes, error message changing ### How was this patch tested? modified tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47295 from yaooqinn/SPARK-48866. Authored-by: Kent Yao <[email protected]> Signed-off-by: yangjie01 <[email protected]>

…uilder framework ### What changes were proposed in this pull request? Document config "spark.default.parallelism". This is Spark used config but not documented by config builder framework. This config is already in spark website: https://spark.apache.org/docs/latest/configuration.html. ### Why are the changes needed? Document Spark's config. ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? N/A ### Was this patch authored or co-authored using generative AI tooling? N/A Closes apache#47171 from amaliujia/document_spark_default_paramllel. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…NAME` COLUMN ...` tests ### What changes were proposed in this pull request? The pr aims to: - Move parser tests from `o.a.s.s.c.p.DDLParserSuite` and `o.a.s.s.c.p.ErrorParserSuite` to `AlterTableRenameColumnParserSuite` & `AlterTableDropColumnParserSuite` - Add a test for DSv2 ALTER TABLE .. `DROP|RENAME` to `v2.AlterTableDropColumnSuite` & `v2.AlterTableRenameColumnSuite` (This PR includes the unification of two commands: `DROP COLUMN` & `RENAME COLUMN`) ### Why are the changes needed? - To improve test coverage. - Align with other similar tests, eg: AlterTableRename* ### Does this PR introduce _any_ user-facing change? No, only tests. ### How was this patch tested? - Add new UT - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47199 from panbingkun/alter_table_drop_column. Authored-by: panbingkun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests ### Why are the changes needed? test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47302 from zhengruifeng/enable_more_doctest. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to remove a duplicate test case in `CSVExprUtilsSuite`. ### Why are the changes needed? Clean duplicate code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47298 from wayneguow/csv_suite. Authored-by: Wei Guo <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Remove the exposed `SQLContext` which was added in SPARK-46575. And migrate STS internal used `SQLContext` to `SparkSession`. ### Why are the changes needed? `SQLContext` is not recommended since Spark 2.0, the suggested replacement is `SparkSession`. We should avoid exposing the deprecated class to Developer API in new versions. ### Does this PR introduce _any_ user-facing change? No. It touched the Developer API added in SPARK-46575, but is not released yet. ### How was this patch tested? Pass GHA, and `dev/mima` (not breaking changes involved) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47176 from pan3793/SPARK-48775. Lead-authored-by: Cheng Pan <[email protected]> Co-authored-by: Cheng Pan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR makes additional Scala logging migrations to comply with the scala style changes in apache#46947 ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested by ensuring dev/scalastyle checks pass ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47275 from asl3/formatstructuredlogmigrations. Lead-authored-by: Amanda Liu <[email protected]> Co-authored-by: Gengliang Wang <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

…to State Data Source ### What changes were proposed in this pull request? In apache#46944 and apache#47188, we introduced some new options to the State Data Source. This PR aims to explain these new features in the documentation. ### Why are the changes needed? It is necessary to reflect the latest change in the documentation website. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The API Doc website can be rendered correctly. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47274 from eason-yuchen-liu/snapshot-doc. Authored-by: Yuchen Liu <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

… by Py4JJavaError in StreamExecution ### What changes were proposed in this pull request? The previous commit apache@1581264 doesn't capture the situation when a job group is cancelled. This patches that situation. ### Why are the changes needed? Bug fix, without this change, calling query.stop() would sometimes (when there is a python foreachBatch function, and this error is thrown) cause query appears as failed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47307 from WweiL/SPARK-48717-job-cancel. Authored-by: Wei Liu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Changed the order of arguments passed in the connect client's trim function call to match [`sql/core/src/main/scala/org/apache/spark/sql/functions.scala`](https://github.com/apache/spark/blob/f2dd0b3338a6937bbfbea6cd5fffb2bf9992a1f3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4322) ### Why are the changes needed? This change fixes a correctness bug in spark connect where a query to trim characters `s` from a column will be replaced by a substring of `s`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated golden files for [`/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala`](https://github.com/apache/spark/blob/f2dd0b3338a6937bbfbea6cd5fffb2bf9992a1f3/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala#L1815) and added an additional test to verify correctness. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47277 from biruktesf-db/fix-trim-connect. Authored-by: Biruk Tesfaye <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Reduces the overhead of `inspect.stack` in `_capture_call_site` by inlining `inspect.stack` with using `generator` instead of `list`. Also, specify `context=0` for `inspect.getframeinfo` to avoid unnecessary field retrievals. ### Why are the changes needed? The `_capture_call_site` has inevitable overhead when `Column` operations happen a lot, but it can be reduced. E.g., ```py from functools import reduce def alias_schema(): return df.select(reduce(lambda x, y: x.alias(f"col_a_{y}"), range(20), F.col("a"))).schema profile(alias_schema) ``` <img width="1106" alt="Screenshot 2024-07-11 at 15 24 31" src="https://github.com/user-attachments/assets/1c677f56-86be-4e8f-9dd2-45c4c2c167f3"> The function calls and duration are: - before ``` 18013240 function calls (18012760 primitive calls) in 2.327 seconds ncalls tottime percall cumtime percall filename:lineno(function) ... 200 0.001 0.000 2.231 0.011 /.../python/pyspark/errors/utils.py:164(_capture_call_site) ``` - after ``` 1421240 function calls (1420760 primitive calls) in 0.265 seconds ncalls tottime percall cumtime percall filename:lineno(function) ... 200 0.001 0.000 0.182 0.001 /.../python/pyspark/errors/utils.py:165(_capture_call_site) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47308 from ueshin/issues/SPARK-48872/inspect_stack. Authored-by: Takuya Ueshin <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… if it has an empty relation ### What changes were proposed in this pull request? In this PR apache#45125, we handled the case where an Aggregate is folded into a Project, causing a count bug. We missed cases where: 1. The entire ScalarSubquery's plan is regarded as empty relation, and is folded completely. 2. There are operations above the Aggregate in the subquery (such as filter and project). ### Why are the changes needed? This PR fixes that by adding the case handling in ConstantFolding and OptimizeSubqueries. ### Does this PR introduce _any_ user-facing change? Yes. There was a correctness error which happens when the scalar subquery is count-bug-susceptible, and empty, and thus folded by `ConstantFolding`. ### How was this patch tested? Added SQL query tests in `scalar-subquery-count-bug.sql`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47290 from andylam-db/decorr-bugs. Authored-by: Andy Lam <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to fix the `sql()` method of the `Collate` expression, and append the `collationName` clause. ### Why are the changes needed? To distinguish column names when the `collationName` argument is used by `collate`. Before the changes, columns might conflict like the example below, and that could confuse users: ``` sql("CREATE TEMP VIEW tbl as (SELECT collate('A', 'UTF8_BINARY'), collate('A', 'UTF8_LCASE'))") ``` - Before: ``` [COLUMN_ALREADY_EXISTS] The column `collate(a)` already exists. Choose another name or rename the existing column. SQLSTATE: 42711 org.apache.spark.sql.AnalysisException: [COLUMN_ALREADY_EXISTS] The column `collate(a)` already exists. Choose another name or rename the existing column. SQLSTATE: 42711 at org.apache.spark.sql.errors.QueryCompilationErrors$.columnAlreadyExistsError(QueryCompilationErrors.scala:2595) at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:115) at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:97) ``` - After: ``` describe extended tbl; +-----------------------+-------------------------+-------+ |col_name |data_type |comment| +-----------------------+-------------------------+-------+ |collate(A, UTF8_BINARY)|string |NULL | |collate(A, UTF8_LCASE) |string collate UTF8_LCASE|NULL | +-----------------------+-------------------------+-------+ ``` ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? Update existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47265 from panbingkun/SPARK-48841. Authored-by: panbingkun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Remove obsolete TODO item ### Why are the changes needed? the `Example 2` test had been already enabled ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47312 from zhengruifeng/simple_folloup. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? apache#47156 introduced a bug in `CatalogV2Util.applyClusterByChanges` that it will remove the existing `ClusterByTransform` first, regardless of whether there is a `ClusterBy` table change. This means any table change will remove the clustering columns from the table. This PR fixes the bug by removing the `ClusterByTransform` only when there is a `ClusterBy` table change. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Amend existing test to catch this bug. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47288 from zedtang/fix-apply-cluster-by-changes. Authored-by: Jiaheng Tang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

… version to `9.0.0` ### What changes were proposed in this pull request? The pr aims to upgrade `MySQ`L docker image version from `8.4.0` to `9.0.0`. ### Why are the changes needed? After https://issues.apache.org/jira/browse/SPARK-48795, we have upgraded the `mysql jdbc driver` version to `9.0.0` for testing, so I propose that the corresponding `mysql server docker image` should also be upgraded to `9.0.0` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47311 from panbingkun/mysql_image_9. Authored-by: panbingkun <[email protected]> Signed-off-by: Kent Yao <[email protected]>

This reverts commit b560e4e.

### What changes were proposed in this pull request? Document non-determinism of max_by and min_by ### Why are the changes needed? I have been confused by this non-determinism twice, it occurred like a correctness bug to me. So I think we need to document it ### Does this PR introduce _any_ user-facing change? doc change only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47266 from zhengruifeng/py_doc_max_by. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

… CheckAnalysis ### What changes were proposed in this pull request? The PR added a trait that logical plans can extend to implement a method to decide whether there can be non-deterministic expressions for the operator, and check this method in checkAnalysis. ### Why are the changes needed? I encountered the `INVALID_NON_DETERMINISTIC_EXPRESSIONS` exception when attempting to use a non-deterministic udf in my query. The non-deterministic expression can be safely allowed for my custom LogicalPlan, but it is disabled in the checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable use cases of non-deterministic expressions are also disabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The test case `"SPARK-48871: AllowsNonDeterministicExpression allow lists non-deterministic expressions"` is added. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47304 from zhipengmao-db/zhipengmao-db/SPARK-48871-check-analysis. Lead-authored-by: zhipeng.mao <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? The pr aims to refactor `HiveQuerySuite` and `fix` bug, includes: - use `getWorkspaceFilePath` to enable `HiveQuerySuite` to run successfully in the IDE. - make the test `lookup hive UDF in another thread` `independence`, without relying on the previous UT `current_database with multiple sessions`. - enable two test: `non-boolean conditions in a CaseWhen are illegal` and `Dynamic partition folder layout`. ### Why are the changes needed? - Run successfully in the `IDE` Before: <img width="1288" alt="image" src="https://github.com/apache/spark/assets/15246973/005fd49c-3edf-4e51-8223-097fd7a485bf"> After: <img width="1276" alt="image" src="https://github.com/apache/spark/assets/15246973/caedec72-be0c-4bb5-bc06-26cceef8b4b8"> - Make UT `lookup hive UDF in another thread` `independence` when `only` running it, it actually failed with the following error: <img width="1318" alt="image" src="https://github.com/apache/spark/assets/15246973/ef9c260f-8c0d-4821-8233-d4d7ae13802a"> **why ?** Because the previous UT `current_database with multiple sessions` changed `current database` and was not restored after it finished running. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47293 from panbingkun/refactor_HiveQuerySuite. Authored-by: panbingkun <[email protected]> Signed-off-by: yangjie01 <[email protected]>

…ctions ### What changes were proposed in this pull request? Test the default column name of array functions ### Why are the changes needed? for test coverage, sometime the default column name is a problem ### Does this PR introduce _any_ user-facing change? doc changes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47318 from zhengruifeng/py_avoid_alias_array_func. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This jar was added in apache#42069 but moved in apache#43735. ### Why are the changes needed? To clean up a jar not used. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests should check ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47315 from HyukjinKwon/minor-cleanup-jar-2. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Martin Grund <[email protected]>

…and Python) ### What changes were proposed in this pull request? This PR introduces `df.mergeInto` support for Spark Connect Scala and Python clients. This work contains four components: 1. New Protobuf messages: command `MergeIntoTableCommand` and expression `MergeAction`. 2. Spark Connect planner change: translate proto messages into real `MergeIntoCommand`s. 3. Connect Scala client: `MetgeIntoWriter` that allows users to build merges. 4. Connect Python client: `MetgeIntoWriter` that allows users to build merges. Components 3 and 4 and independent to each other. They both depends on Component 1. ### Why are the changes needed? We need to increase the functionality of Spark Connect to be on par with Classic. ### Does this PR introduce _any_ user-facing change? Yes, new Dataframe APIs are introduced. ### How was this patch tested? Added new tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46960 from xupefei/merge-builder. Authored-by: Paddy Xu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

….1-jre ### What changes were proposed in this pull request? The pr aims to upgrade Guava used by the `connect` module to `33.2.1-jre`. ### Why are the changes needed? The new version bring some fixes and changes as follows: - Changed InetAddress-String conversion methods to preserve the IPv6 scope ID if present. The scope ID can be necessary for IPv6-capable devices with multiple network interfaces. - Added HttpHeaders constants Ad-Auction-Allowed, Permissions-Policy-Report-Only, and Sec-GPC - Fixed a potential NullPointerException in ImmutableMap.Builder on a rare code path。 The full release notes: - https://github.com/google/guava/releases/tag/v33.2.0 - https://github.com/google/guava/releases/tag/v33.2.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47296 from LuciferYang/connect-guava-33.2.1. Authored-by: yangjie01 <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? Add doctests for `options` in json functions ### Why are the changes needed? test coverage, we never test `options` in `from_json` and `to_json` before since it is a new underlying implementation in Spark Connect, we should explicitly test it ### Does this PR introduce _any_ user-facing change? doc changes ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47319 from zhengruifeng/from_json_option. Lead-authored-by: Kent Yao <[email protected]> Co-authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

### What changes were proposed in this pull request? This PR proposes to prevent pushing down Python UDFs. This PR uses the same approach as apache#47033, therefore added the author as a co-author, but simplifies the change. Extracting filters to push down happens first https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L46 https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L211 https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L51 Before extracting Python UDFs https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L80 Here is full stacktrace: ``` [INTERNAL_ERROR] Cannot evaluate expression: pyUDF(cast(input[0, bigint, true] as string)) SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: pyUDF(cast(input[0, bigint, true] as string)) SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:65) at org.apache.spark.sql.catalyst.expressions.FoldableUnevaluable.eval(Expression.scala:387) at org.apache.spark.sql.catalyst.expressions.FoldableUnevaluable.eval$(Expression.scala:386) at org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:72) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:563) at org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:403) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:53) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:189) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:188) at scala.collection.immutable.List.filter(List.scala:516) at scala.collection.immutable.List.filter(List.scala:79) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:188) at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:604) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:1358) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.listPartitionsByFilter(ExternalCatalogUtils.scala:168) at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:74) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:72) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:470) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:330) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:326) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:475) at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1251) at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1250) at org.apache.spark.sql.catalyst.plans.logical.Join.mapChildren(basicLogicalOperators.scala:552) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:475) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:330) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:326) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:446) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:50) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:226) at scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:183) at scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:179) at scala.collection.immutable.List.foldLeft(List.scala:79) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:223) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:215) at scala.collection.immutable.List.foreach(List.scala:334) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:215) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:186) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:186) at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:167) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:234) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:608) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:234) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:233) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:163) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:159) at org.apache.spark.sql.execution.python.PythonUDFSuite.$anonfun$new$19(PythonUDFSuite.scala:136) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99) at org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:307) at org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:305) at org.apache.spark.sql.execution.python.PythonUDFSuite.withTable(PythonUDFSuite.scala:25) at org.apache.spark.sql.execution.python.PythonUDFSuite.$anonfun$new$18(PythonUDFSuite.scala:130) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127) at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282) at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231) at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230) at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69) at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69) at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) at scala.collection.immutable.List.foreach(List.scala:334) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) at org.scalatest.Suite.run(Suite.scala:1114) at org.scalatest.Suite.run$(Suite.scala:1096) at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) at org.scalatest.SuperEngine.runImpl(Engine.scala:535) at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:69) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69) at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321) at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315) at scala.collection.immutable.List.foreach(List.scala:334) at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992) at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970) at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481) at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970) at org.scalatest.tools.Runner$.run(Runner.scala:798) at org.scalatest.tools.Runner.run(Runner.scala) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43) at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26) ``` ### Why are the changes needed? In order for end users to use Python UDFs against partitioned columns. ### Does this PR introduce _any_ user-facing change? Yes, this fixes a bug - this PR allows to use Python UDF in partitioned columns. ### How was this patch tested? Unittest added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47033 Closes apache#47313 from HyukjinKwon/SPARK-48666. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: Wei Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences. Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail. ``` select udf_catch_exception(udf_exception(col1)) from table ``` ### Why are the changes needed? For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`. Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47268 from jackylee-ch/generic_udf_catch_exception_from_child_func. Lead-authored-by: jackylee-ch <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <[email protected]> Co-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Co-authored-by: Ted Chester Jenks <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? This PR follows apache#46245 to add support `udaf.toColumn` API in Spark Connect. Here we introduce a new Protobuf message, `proto.TypedAggregateExpression`, that includes a serialized UDF packet. On the server, we unpack it into an `Aggregator` object and generate a real `TypedAggregateExpression` instance with the encoder information passed along with the UDF. ### Why are the changes needed? Because the `toColumn` API is not supported in the previous PR. ### Does this PR introduce _any_ user-facing change? Yes, from now on users could create typed UDAF using `udaf.toColumn` API/. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? Nope. Closes apache#46849 from xupefei/connect-udaf-tocolumn. Authored-by: Paddy Xu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…aframe read / write API ### What changes were proposed in this pull request? Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? In databricks runtime, RDD read / write API has some issue for certain storage types that requires the account key, but Dataframe read / write API works. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47328 from WeichenXu123/ml-df-writer-save-2. Authored-by: Weichen Xu <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

…collations ### What changes were proposed in this pull request? String searching in UTF8_LCASE now works on character-level, rather than on byte-level. For example: `translate("İ", "i")` now returns `"İ"`, because there exists no **single character** in `"İ"` such that lowercased version of that character equals to `"i"`. Note, however, that there _is_ a byte subsequence of `"İ"` such that lowercased version of that UTF-8 byte sequence equals to `"i"` (so the new behaviour is different than the old behaviour). Also, translation for ICU collations works by repeatedly translating the longest possible substring that matches a key in the dictionary (under the specified collation), starting from the left side of the input string, until the entire string is translated. ### Why are the changes needed? Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above). ### Does this PR introduce _any_ user-facing change? Yes, behaviour of `translate` expression is changed for edge cases with one-to-many case mapping. ### How was this patch tested? New unit tests in `CollationStringExpressionsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46761 from uros-db/alter-translate. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…OutputStyle ### What changes were proposed in this pull request? Make a follow-up for SPARK-47911 to rename UTF8 to UTF-8 of `spark.sql.binaryOutputStyle`, so that we could have a consistent name with `org.apache.spark.sql.catalyst.util.CharsetProvider.VALID_CHARSETS` and `java.nio.charset.StandardCharsets.UTF_8` ### Why are the changes needed? reduce cognitive cost for users ### Does this PR introduce _any_ user-facing change? no, unreleased feature ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47322 from yaooqinn/SPARK-47911-FF. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…rs` by default ### What changes were proposed in this pull request? This PR aims to enable `spark.kubernetes.executor.checkAllContainers` by default from Apache Spark 4.0.0. ### Why are the changes needed? Since Apache Spark 3.1.0, `spark.kubernetes.executor.checkAllContainers` is supported and useful because [sidecar pattern](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) is used in many cases. Also, it prevents user mistakes which forget and ignore the sidecars' failures by always reporting sidecar failures via executor status. - apache#29924 ### Does this PR introduce _any_ user-facing change? - This configuration is no-op when there is no other container. - This will report user containers' error correctly when there exist other containers which are provided by the users. ### How was this patch tested? Both `true` and `false` are covered by our CI test coverage since Apache Spark 3.1.0. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47337 from dongjoon-hyun/SPARK-48887. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? For the Variant data type, we plan to add support for columnar storage formats (e.g. Parquet) to write the data shredded across multiple physical columns, and read only the data required for a given query. This PR merges a document describing the approach we plan to take. We can continue to update it as the implementation progresses. ### Why are the changes needed? When implemented, can allow much better performance when reading from columnar storage. More detail is given in the document. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It is internal documentation, no testing should be needed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46831 from cashmand/SPARK-45891. Authored-by: cashmand <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…with Dataframe read / write API" This reverts commit 0fa5787.

### What changes were proposed in this pull request? This PR aims to use R 4.4.1 in `windows` R GitHub Action job. ### Why are the changes needed? R 4.4.1 is the latest release which is released on 2024-06-14 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47346 from dongjoon-hyun/SPARK-48895. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…o API reference ### What changes were proposed in this pull request? Add `mergeInto` to API reference ### Why are the changes needed? this feature was missing in doc ### Does this PR introduce _any_ user-facing change? yes, doc change ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47329 from zhengruifeng/py_doc_merge_into. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…s than partition keys ### What changes were proposed in this pull request? This is the final planned SPJ scenario: auto-shuffle one side + less join keys than partition keys. Background: - Auto-shuffle works by creating ShuffleExchange for the non-partitioned side, with a clone of the partitioned side's KeyGroupedPartitioning. - "Less join key than partition key" works by 'projecting' all partition values by join keys (ie, keeping only partition columns that are join columns). It makes a target KeyGroupedShuffleSpec with 'projected' partition values, and then pushes this down to BatchScanExec. The BatchScanExec then 'groups' its projected partition value (except in the skew case but that's a different story..). This combination is hard because the SPJ planning calls is spread in several places in this scenario. Given two sides, a non-partitioned side and a partitioned side, and the join keys are only a subset: 1. EnsureRequirements creates the target KeyGroupedShuffleSpec from the join's required distribution (ie, using only the join keys, not all partition keys). 2. EnsureRequirements copies this to the non-partitoned side's KeyGroupedPartition (for the auto-shuffle case) 3. BatchScanExec groups the partitions (for the partitioned side), including by join keys (if they differ from partition keys). Take the example partition columns (id, name) , and partition values: (1, "bob"), (2, "alice"), (2, "sam"). Projection leaves us (1, 2, 2), and the final grouped partition values are (1, 2). The problem is, that the two sides of the join do not match at all times. After the steps 1 and 2, the partitioned side has the 'projected' partition values (1, 2, 2), and the non-partitioned side creates a matching KeyGroupedPartitioning (1, 2, 2) for ShuffleExechange. But on step 3, the BatchScanExec for partitioned side groups the partitions to become (1, 2), but the non-partitioned side does not group and still retains (1, 2, 2) partitions. This leads to following assert error from the join: ``` requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. java.lang.IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. at scala.Predef$.require(Predef.scala:337) at org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.<init>(partitioning.scala:550) at org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning(ShuffledJoin.scala:49) at org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning$(ShuffledJoin.scala:47) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:39) at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:66) at scala.collection.immutable.Vector1.map(Vector.scala:2140) at scala.collection.immutable.Vector1.map(Vector.scala:385) at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:65) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:657) at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:632) ``` The fix is to do the de-duplication in first pass. 1. Pushing down join keys to the BatchScanExec to return a de-duped outputPartitioning (partitioned side) 2. Creating the non-partitioned side's KeyGroupedPartitioning with de-duped partition keys (non-partitioned side). ### Why are the changes needed? This is the last planned scenario for SPJ not yet supported. ### How was this patch tested? Update existing unit test in KeyGroupedPartitionSuite ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47064 from szehon-ho/spj_less_join_key_auto_shuffle. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Chao Sun <[email protected]>

…, UDTFs, UDAFs during query compilation ### What changes were proposed in this pull request? Throws an exception if a variant is the input/output type to/from python UDF, UDAF, UDTF ### Why are the changes needed? currently, variant input/output types to scalar UDFs will fail during execution or return a `net.razorvine.pickle.objects.ClassDictConstructor` to the user python code. For a better UX, we should fail during query compilation for failures, and block returning `ClassDictConstructor` to user code as we one day want to actually return `VariantVal`s to the user code. ### Does this PR introduce _any_ user-facing change? yes - attempting to use variants in python UDFs will now throw an exception rather than returning a `ClassDictConstructor` as before. However, we want to make this change now as we one day want to be able to return `VariantVal`s to the user code and do not want users relying on this current behavior ### How was this patch tested? added UTs ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47253 from richardc-db/variant_scalar_udfs. Authored-by: Richard Chen <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? Remove snapshot creation based on changelog ops size ### Why are the changes needed? Current mechanism to create snapshot is based on num batches or num ops in changelog. However, the latter is not configurable and might not be analogous to large snapshot sizes in all cases leading to variance in e2e latency. Hence, removing this condition for now. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Augmented unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common... [info] Run completed in 5 minutes, 7 seconds. [info] Total number of tests run: 176 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 332 s (05:32), completed Jul 12, 2024, 2:46:44 PM ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47338 from anishshri-db/task/SPARK-48888. Authored-by: Anish Shrigondekar <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

… fails to initialize ### What changes were proposed in this pull request? This pr skips clear memoryStore if memoryManager is null. This could happen if driver plugin fails to initialize, since we initialize MemoryManager after DriverPlugin. ### Why are the changes needed? before it would throw: ``` {"class":"java.lang.NullPointerException","msg":"Cannot invoke \"org.apache.spark.memory.MemoryManager.maxOnHeapStorageMemory()\" because \"this.memoryManager\" is null","stacktrace":[{"class":"org.apache.spark.storage.memory.MemoryStore","method":"maxMemory","file":"MemoryStore.scala","line":110}, {"class":"org.apache.spark.storage.memory.MemoryStore","method":"<init>","file":"MemoryStore.scala","line":113}, {"class":"org.apache.spark.storage.BlockManager","method":"memoryStore$lzycompute","file":"BlockManager.scala","line":234}, {"class":"org.apache.spark.storage.BlockManager","method":"memoryStore","file":"BlockManager.scala","line":233}, {"class":"org.apache.spark.storage.BlockManager","method":"stop","file":"BlockManager.scala","line":2167}, {"class":"org.apache.spark.SparkEnv","method":"stop","file":"SparkEnv.scala","line":118}, {"class":"org.apache.spark.SparkContext","method":"$anonfun$stop$25","file":"SparkContext.scala","line":2369}, {"class":"org.apache.spark.util.Utils$","method":"tryLogNonFatalError","file":"Utils.scala","line":1299}, {"class":"org.apache.spark.SparkContext","method":"stop","file":"SparkContext.scala","line":2369} ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47321 from ulysses-you/minor. Authored-by: ulysses-you <[email protected]> Signed-off-by: youxiduo <[email protected]>

### What changes were proposed in this pull request? This PR aims to upgrade `docker-java` to 3.4.0. ### Why are the changes needed? There some improvements, such as: - Enhancements Enable protocol configuration of SSLContext (docker-java/docker-java#2337) - Bug Fixes Consider already existing images as successful pulls (docker-java/docker-java#2335) Full release notes: https://github.com/docker-java/docker-java/releases/tag/3.4.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47344 from wayneguow/SPARK-48894. Authored-by: Wei Guo <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? Make StringIndexer supporting nested input columns ### Why are the changes needed? User demand. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#47283 from WeichenXu123/SPARK-48463. Lead-authored-by: Weichen Xu <[email protected]> Co-authored-by: WeichenXu <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

…tions ### What changes were proposed in this pull request? String searching in UTF8_LCASE now works on character-level, rather than on byte-level. For example: `ltrim("İ", "i")` now returns `"İ"`, because there exist **no characters** in `"İ"`, starting from the left, such that lowercased version of those characters are equal to `"i"`. Note, however, that there is a byte subsequence of `"İ"` such that lowercased version of that UTF-8 byte sequence equals to `"i"` (so the new behaviour is different than the old behaviour). Also, translation for ICU collations works by repeatedly trimming the longest possible substring that matches a character in the trim string, starting from the left side of the input string, until trimming is done. ### Why are the changes needed? Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_LCASE (see example above). ### Does this PR introduce _any_ user-facing change? Yes, behaviour of `trim*` expressions is changed for collated strings for edge cases with one-to-many case mapping. ### How was this patch tested? New unit tests in `CollationSupportSuite` and new e2e sql tests in `CollationStringExpressionsSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46762 from uros-db/alter-trim. Authored-by: Uros Bojanic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…rgument list by inserting () is deprecated` ### What changes were proposed in this pull request? The pr aims to fix compilation warning: `adaptation of an empty argument list by inserting () is deprecated` ### Why are the changes needed? Fix compilation warning. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually check. Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47350 from panbingkun/ParquetCommitterSuite_deprecated. Authored-by: panbingkun <[email protected]> Signed-off-by: yangjie01 <[email protected]>

### What changes were proposed in this pull request? This PR aims to fix `ENV` key value format in K8s Dockerfiles. ### Why are the changes needed? To follow the Docker guideline to fix the following legacy format. - https://docs.docker.com/reference/build-checks/legacy-key-value-format/ ``` - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47357 from dongjoon-hyun/SPARK-48899. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…r evolution ### What changes were proposed in this pull request? Add version info to changelog v2 to allow for easier evolution ### Why are the changes needed? Currently the changelog file format does not add the version info. With format v2, we propose to add this to the changelog file itself to make future evolution easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Augmented unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common... [info] Run completed in 4 minutes, 23 seconds. [info] Total number of tests run: 176 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47336 from anishshri-db/task/SPARK-48886. Authored-by: Anish Shrigondekar <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? Previous PRs introduced basic changes for SQL Scripting. This PR is a follow-up to introduce custom exceptions that can arise while using SQL Scripting language. ### Why are the changes needed? The intent is to add precise errors for various SQL scripting concepts. ### Does this PR introduce any user-facing change? Users will now see specific SQL Scripting language errors. ### How was this patch tested? There are tests for newly introduced parser changes: SqlScriptingParserSuite - unit tests for execution nodes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47147 from miland-db/sql_batch_custom_errors. Lead-authored-by: Milan Dankovic <[email protected]> Co-authored-by: David Milicevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

This PR is based on apache#42911. ### What changes were proposed in this pull request? - Enables Scala and Java Unidoc generation for the `connectClient` project. - Generates docs and moves them to the `docs/api/connect` folder. Some methods' documentation in the connect directory had to be modified to remove references to avoid javadoc generation failures. **References API docs in the main index page and the global floating header will be added in a later PR.** ### Why are the changes needed? Increasing scope of documentation for the Spark Connect JVM/Scala Client project. ### Does this PR introduce _any_ user-facing change? Nope. ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47332 from xupefei/connnect-doc-web. Authored-by: Paddy Xu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…rsion field on security page ### What changes were proposed in this pull request? Given that SPARK-47172 was an improvement but got merged into 3.4/3.5, we need to fix the since version to eliminate misunderstandings. ### Why are the changes needed? doc fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? doc build ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47353 from yaooqinn/SPARK-47172. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…e replacement to lazy val ### What changes were proposed in this pull request? This PR makes 8 subclasses of RuntimeReplaceable override replacement to lazy val to align with other 60+ members and avoid recreation of new expressions ```scala Value read (51 usages found) spark-catalyst_2.13 (50 usages found) AnyValue.scala (1 usage found) 54 override lazy val replacement: Expression = First(child, ignoreNulls) arithmetic.scala (1 usage found) 127 override lazy val replacement: Expression = child bitmapExpressions.scala (3 usages found) 52 override lazy val replacement: Expression = StaticInvoke( 85 override lazy val replacement: Expression = StaticInvoke( 134 override lazy val replacement: Expression = StaticInvoke( boolAggregates.scala (2 usages found) 39 override lazy val replacement: Expression = Min(child) 61 override lazy val replacement: Expression = Max(child) collationExpressions.scala (1 usage found) 123 override def replacement: Expression = { collectionOperations.scala (5 usages found) 168 override lazy val replacement: Expression = Size(child, legacySizeOfNull = false) 231 override lazy val replacement: Expression = ArrayContains(MapKeys(left), right) 1596 override lazy val replacement: Expression = new ArrayInsert(left, Literal(1), right) 1631 override lazy val replacement: Expression = new ArrayInsert(left, Literal(-1), right) 5203 override lazy val replacement: Expression = ArrayFilter(child, lambda) CountIf.scala (1 usage found) 42 override lazy val replacement: Expression = Count(new NullIf(child, Literal.FalseLiteral)) datetimeExpressions.scala (2 usages found) 2070 override lazy val replacement: Expression = format.map { f => 2145 override lazy val replacement: Expression = format.map { f => linearRegression.scala (5 usages found) 45 override lazy val replacement: Expression = Count(Seq(left, right)) 79 override lazy val replacement: Expression = 114 override lazy val replacement: Expression = 176 override lazy val replacement: Expression = 232 override lazy val replacement: Expression = misc.scala (3 usages found) 294 override lazy val replacement: Expression = StaticInvoke( 397 override lazy val replacement: Expression = StaticInvoke( 475 override lazy val replacement: Expression = StaticInvoke( percentiles.scala (2 usages found) 346 override def replacement: Expression = percentile 365 override def replacement: Expression = percentile regexpExpressions.scala (3 usages found) 262 override lazy val replacement: Expression = Like(Lower(left), Lower(right), escapeChar) 1034 override lazy val replacement: Expression = 1072 override lazy val replacement: Expression = stringExpressions.scala (14 usages found) 561 override lazy val replacement = 723 override lazy val replacement: Expression = Invoke(input, "isValid", BooleanType) 770 override lazy val replacement: Expression = Invoke(input, "makeValid", input.dataType) 810 override lazy val replacement: Expression = StaticInvoke( 859 override lazy val replacement: Expression = StaticInvoke( 1854 override lazy val replacement: Expression = StaticInvoke( 2246 override lazy val replacement: Expression = If( 2284 override lazy val replacement: Expression = Substring(str, Literal(1), len) 2713 override def replacement: Expression = StaticInvoke( 2940 override def replacement: Expression = StaticInvoke( 3004 override val replacement: Expression = StaticInvoke( 3075 override lazy val replacement: Expression = if (fmt == null) { 3473 override lazy val replacement: Expression = 3533 override lazy val replacement: Expression = StaticInvoke( toFromAvroSqlFunctions.scala (2 usages found) 96 override def replacement: Expression = { 168 override def replacement: Expression = { urlExpressions.scala (2 usages found) 55 override def replacement: Expression = 92 override def replacement: Expression = variantExpressions.scala (3 usages found) 58 override lazy val replacement: Expression = StaticInvoke( 100 override lazy val replacement: Expression = StaticInvoke( 635 override lazy val replacement: Expression = StaticInvoke( spark-examples_2.13 (1 usage found) AgeExample.scala (1 usage found) 27 override lazy val replacement: Expression = SubtractDates(CurrentDate(), birthday) ``` ### Why are the changes needed? Improve RuntimeReplaceable implementations ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47333 from yaooqinn/SPARK-48885. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…akeInterval` ### What changes were proposed in this pull request? Remove unused helper function `PythonSQLUtils.makeInterval` ### Why are the changes needed? As a followup cleanup of apache@bd14d64 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? NO Closes apache#47330 from zhengruifeng/py_sql_utils_cleanup. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…tructType schema ### What changes were proposed in this pull request? Make `from_xml` support StructType schema ### Why are the changes needed? StructType schema was supported in Spark Classic, but not in Spark Connect to address apache#43680 (comment) ### Does this PR introduce _any_ user-facing change? before: ``` from pyspark.sql.types import StructType, LongType import pyspark.sql.functions as sf data = [(1, '''<p><a>1</a></p>''')] df = spark.createDataFrame(data, ("key", "value")) schema = StructType().add("a", LongType()) df.select(sf.from_xml(df.value, schema)).show() --------------------------------------------------------------------------- AnalysisException Traceback (most recent call last) Cell In[1], line 7 ... AnalysisException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601 JVM stacktrace: org.apache.spark.sql.AnalysisException at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:278) at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98) at org.apache.spark.sql.catalyst.parser.AbstractParser.parseDataType(parsers.scala:40) at org.apache.spark.sql.types.DataType$.$anonfun$fromDDL$1(DataType.scala:126) at org.apache.spark.sql.types.DataType$.parseTypeWithFallback(DataType.scala:145) at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:127) ``` after: ``` +---------------+ |from_xml(value)| +---------------+ | {1}| +---------------+ ``` ### How was this patch tested? added doctest and enabled unit tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47355 from zhengruifeng/from_xml_struct. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…on api `partitioning` functions docs ### What changes were proposed in this pull request? Add a missing param in func docs of `partitioning.py`. ### Why are the changes needed? - Make python api docs better. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA and docs check. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47345 from wayneguow/py_f_docs. Authored-by: Wei Guo <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? The pr aims to upgrade `commons-codec` from `1.17.0` to `1.17.1`. ### Why are the changes needed? The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.17.1 This version has fixed some bugs from the previous version, eg: - Md5Crypt now throws IllegalArgumentException on an invalid prefix ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47362 from panbingkun/SPARK-48902. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? It uses `UnsafeRow` to represent struct result in the JSON parser. It saves memory compared to the current `GenericInternalRow`. The change is guarded by a flag and disabled by default. The benchmark shows that enabling the flag brings ~10% slowdown. This is basically expected because converting to `UnsafeRow` requires some work. The purpose of the PR is to provide an alternative to save memory. I did the following experiment. It generates a big `.gz` JSON file containing a single large array. Each array element is a struct with 50 string fields and will be parsed into a row by the JSON reader. ``` s = b'{"field00":null,"field01":"field01_<v>","field02":"field02_<v>","field03":"field03_<v>","field04":"field04_<v>","field05":"field05_<v>","field06":"field06_<v>","field07":"field07_<v>","field08":"field08_<v>","field09":"field09_<v>","field10":null,"field11":"field11_<v>","field12":"field12_<v>","field13":"field13_<v>","field14":"field14_<v>","field15":"field15_<v>","field16":"field16_<v>","field17":"field17_<v>","field18":"field18_<v>","field19":"field19_<v>","field20":null,"field21":"field21_<v>","field22":"field22_<v>","field23":"field23_<v>","field24":"field24_<v>","field25":"field25_<v>","field26":"field26_<v>","field27":"field27_<v>","field28":"field28_<v>","field29":"field29_<v>","field30":null,"field31":"field31_<v>","field32":"field32_<v>","field33":"field33_<v>","field34":"field34_<v>","field35":"field35_<v>","field36":"field36_<v>","field37":"field37_<v>","field38":"field38_<v>","field39":"field39_<v>","field40":null,"field41":"field41_<v>","field42":"field42_<v>","field43":"field43_<v>","field44":"field44_<v>","field45":"field45_<v>","field46":"field46_<v>","field47":"field47_<v>","field48":"field48_<v>","field49":"field49_<v>"}' import gzip def write(n): with gzip.open(f'json{n}.gz', 'w') as f: f.write(b'[') for i in range(n): if i != 0: f.write(b',') f.write(s.replace(b'<v>', str(i).encode('ascii'))) f.write(b']') write(100000) ``` Then it processes the file in Spark shell with the following command: ``` ./bin/spark-shell --conf spark.driver.memory=1g --conf spark.executor.memory=1g --master "local[1]" > val schema = "field00 string, field01 string, field02 string, field03 string, field04 string, field05 string, field06 string, field07 string, field08 string, field09 string, field10 string, field11 string, field12 string, field13 string, field14 string, field15 string, field16 string, field17 string, field18 string, field19 string, field20 string, field21 string, field22 string, field23 string, field24 string, field25 string, field26 string, field27 string, field28 string, field29 string, field30 string, field31 string, field32 string, field33 string, field34 string, field35 string, field36 string, field37 string, field38 string, field39 string, field40 string, field41 string, field42 string, field43 string, field44 string, field45 string, field46 string, field47 string, field48 string, field49 string" > spark.conf.set("spark.sql.json.useUnsafeRow", "false") > spark.read.schema(schema).option("multiline", "true").json("json100000.gz").selectExpr("sum(hash(struct(*)))").collect() ``` When the flag is off (the current behavior), the query can process 2.5e5 rows but fails to process 3e5 rows. When the flag is on, the query can process 8e5 rows but fails to process 9e5 rows. We can say this change reduces the memory consumption to about 1/3. ### Why are the changes needed? It reduces the memory requirement of JSON-related query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new JSON unit test with the config flag on. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47310 from chenhao-db/json_unsafe_row. Authored-by: Chenhao Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file. ### Why are the changes needed? In order to remove unnecessary shuffle, see also apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should verify them. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47347 from HyukjinKwon/SPARK-48896. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ting metadata ### What changes were proposed in this pull request? This PR proposes to use SparkSession over SparkContext when writing metadata ### Why are the changes needed? See apache#47347 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests should cover it. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47366 from HyukjinKwon/SPARK-48909. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Inspired by apache#47258, I am checking other ML implementations, and find that we can also optimize `Tokenizer` in the same way ### Why are the changes needed? the function `createTransformFunc` is to build the udf for `UnaryTransformer.transform`: https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L118 existing implementation read the params for each row. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI and manually tests: create test dataset ``` spark.range(1000000).select(uuid().as("uuid")).write.mode("overwrite").parquet("/tmp/regex_tokenizer.parquet") ``` duration ``` val df = spark.read.parquet("/tmp/regex_tokenizer.parquet") import org.apache.spark.ml.feature._ val tokenizer = new RegexTokenizer().setPattern("-").setInputCol("uuid") Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()) // warm up val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic ``` result (before this PR) ``` scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic val tic: Long = 1720613235068 val res5: Long = 50397 ``` result (after this PR) ``` scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic val tic: Long = 1720612871256 val res5: Long = 43748 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47342 from zhengruifeng/opt_tokenizer. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…aframe read / write API ### What changes were proposed in this pull request? This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition: https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57 ### Why are the changes needed? In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918. Also, we remove `repartition(1)`. To avoid unnecessary shuffle. With `repartition(1)`: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6] +- LocalTableScan [_1#0] ``` Without `repartition(1)`: ``` == Physical Plan == LocalTableScan [_1#2] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI in this PR should verify the change ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47341 from HyukjinKwon/SPARK-48883-followup. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…remote load ### What changes were proposed in this pull request? Set the RocksDB last snapshot version correctly on remote load ### Why are the changes needed? Avoid creating full snapshot on every first batch after restart and also reset a snapshot that is likely no longer valid ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common... [info] Run completed in 4 minutes, 40 seconds. [info] Total number of tests run: 176 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47363 from anishshri-db/task/SPARK-48903. Authored-by: Anish Shrigondekar <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…nning tests in Maven ### What changes were proposed in this pull request? This PR fixes an issue where the TypeTag look up during `udaf.toColumn` failed in Maven test env with the following error: > java.lang.IllegalArgumentException: Type tag defined in [JavaMirror with jdk.internal.loader.ClassLoaders$AppClassLoader1dbd16a6 of type class jdk.internal.loader.ClassLoaders$AppClassLoader with classpath [<unknown>] and parent being jdk.internal.loader.ClassLoaders$PlatformClassLoader6bd61f98 of type class jdk.internal.loader.ClassLoaders$PlatformClassLoader with classpath [<unknown>] and parent being primordial classloader with boot classpath [<unknown>]] cannot be migrated to another mirror [JavaMirror <ins>with java.net.URLClassLoader5a4041cc of type class java.net.URLClassLoader with classpath [file:/\<redacted\>/spark/connector/connect/client/jvm/target/scala-2.13/classes/,file:/\<redacted\>/spark/connector/connect/client/jvm/target/scala-2.13/test-classes/]</ins> and parent being jdk.internal.loader.ClassLoaders$AppClassLoader1dbd16a6 of type class jdk.internal.loader.ClassLoaders$AppClassLoader with classpath [<unknown>] and parent being jdk.internal.loader.ClassLoaders$PlatformClassLoader6bd61f98 of type class jdk.internal.loader.ClassLoaders$PlatformClassLoader with classpath [<unknown>] and parent being primordial classloader with boot classpath [<unknown>]]. The problem is caused by Maven adding a `URLClassLoader` on top of the original `AppClassLoader` (see the underlined texts in the above error message). This PR changes the mirror-matching logic from `eq` to `hasCommonAncestors`. ### Why are the changes needed? Previous logic fails in tests env. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47368 from xupefei/udaf-tocolumn-fixup. Authored-by: Paddy Xu <[email protected]> Signed-off-by: Haejoon Lee <[email protected]>

…change of base64 function  ### What changes were proposed in this pull request? Follow up to apache#47303 Add a migration guide for the behavior change of `base64` function ### Why are the changes needed?  ### Does this PR introduce _any_ user-facing change?  No ### How was this patch tested?  doc change ### Was this patch authored or co-authored using generative AI tooling?  No Closes apache#47371 from wForget/SPARK-47307_doc. Authored-by: wforget <[email protected]> Signed-off-by: allisonwang-db <[email protected]>

### What changes were proposed in this pull request? In the end of each testStream() call, unload all state stores from the executor ### Why are the changes needed? Currently, after a test, we don't unload state store or disable maintenance task. So after a test, the maintenance task can run and fail as the checkpoint directory is already deleted. This might cause an issue and fail the next test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? See existing tests to pass ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47339 from siying/SPARK-48889. Authored-by: Siying Dong <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? Add a `try_url_decode` function that performs the same operation as `url_decode`, but returns a NULL value instead of raising an error if the decoding cannot be performed. ### Why are the changes needed? In hive we usually do url decoding like: `reflect('java.net.URLDecoder', 'decode', 'test%1')`, and return a `NULL` value instead of raising an error if the decoding cannot be performed. Although spark provides a `try_reflect` function to do this, but as commented in apache#34023 (comment), the `reflect` function may cause partition pruning to does not take effect. So I propose to add a new `try_url_decode` function. ### Does this PR introduce _any_ user-facing change? add a new function ### How was this patch tested? added tests and did manual testing spark-sql: ![image](https://github.com/apache/spark/assets/17894939/0ffd3aa2-98f7-4af4-b478-67002b8b0d4b) pyspark: ![image](https://github.com/apache/spark/assets/17894939/d2c1926b-f9a0-422c-abc9-5f224d822811) ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47294 from wForget/try_url_decode. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

…ySuite` ### What changes were proposed in this pull request? The pr aims to fix the incorrect logic of `CollationFactorySuite`. ### Why are the changes needed? Only fix `CollationFactorySuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Update existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47382 from panbingkun/fix_CollationFactorySuite. Authored-by: panbingkun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…CH.EXPLICIT` ### What changes were proposed in this pull request? The pr aims to - fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT`. - use `checkError` to check exception in `CollationSQLExpressionsSuite` and `CollationStringExpressionsSuite`. ### Why are the changes needed? Only fix bug, eg: ``` SELECT concat_ws(' ', collate('Spark', 'UTF8_LCASE'), collate('SQL', 'UNICODE')) ``` - Before: ``` [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use for string functions and operators. Error occurred due to the mismatch between explicit collations: `string collate UTF8_LCASE`.`string collate UNICODE`. Decide on a single explicit collation and remove others. SQLSTATE: 42P21 ``` <img width="747" alt="image" src="https://github.com/user-attachments/assets/4e026cb5-2875-4370-9bb9-878f0b607f41"> - After: ``` [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use for string functions and operators. Error occurred due to the mismatch between explicit collations: [`string collate UTF8_LCASE`, `string collate UNICODE`]. Decide on a single explicit collation and remove others. SQLSTATE: 42P21 ``` <img width="738" alt="image" src="https://github.com/user-attachments/assets/86f489a2-9f2d-4f59-bdb1-95c051a93ee8"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existed UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47365 from panbingkun/SPARK-48907. Authored-by: panbingkun <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR aims to show the number of cached RDDs in `StoragePage` like the other `Jobs` page or `Stages` page. ### Why are the changes needed? To improve the UX by providing additional summary information in a consistent way. **BEFORE** ![Screenshot 2024-07-17 at 09 46 44](https://github.com/user-attachments/assets/3e57bf91-e97d-404d-aeda-159ab9cb65e3) **AFTER** ![Screenshot 2024-07-17 at 09 46 01](https://github.com/user-attachments/assets/d416ea16-8255-48d8-ade4-624dcac8f46e) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47390 from dongjoon-hyun/SPARK-48927. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…pattern ### What changes were proposed in this pull request? This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern. - **Apache Spark 4.0.0-preview1** There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all. ``` $ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell ``` ![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426) ### Why are the changes needed? Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently. - apache#450 https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486 ### Does this PR introduce _any_ user-facing change? Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials. ### How was this patch tested? Pass the CIs with the newly added test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47392 from dongjoon-hyun/SPARK-48930. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Add a pandas-like `make_interval` helper function ### Why are the changes needed? factor it out as a helper function to be reusable ### Does this PR introduce _any_ user-facing change? No, internal change only ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47385 from zhengruifeng/ps_simplify_make_interval. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…n running tests in Maven ### What changes were proposed in this pull request? This PR follows apache#47368 as another try to fix the broken tests. The previous try failed due to NPE, caused by `Iterator.iterate` generating an **infinite** flow of values. I can't reproduce the previous issue locally, so my fix is purely based on the error message: https://github.com/apache/spark/actions/runs/9974746135/job/27562881993. ### Why are the changes needed? Because previous one failed. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47387 from xupefei/udaf-tocolumn-fixup-mk2. Authored-by: Paddy Xu <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…on check logic related to `UNRESOLVED_COLUMN` error classes ### What changes were proposed in this pull request? This PR aims to use `checkError` method to optimize exception check logic related to `UNRESOLVED_COLUMN` error classes ### Why are the changes needed? Unify error classes check way to `checkError` method. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass related test cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47389 from wayneguow/op_un_col. Authored-by: Wei Guo <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? The pr aims to upgrade `commons-lang3` from `3.14.0` to `3.15.0` ### Why are the changes needed? - v3.14.0 VS v3.15.0 apache/commons-lang@rel/commons-lang-3.14.0...rel/commons-lang-3.15.0 - The new version has brought some bug fixes, eg: apache/commons-lang#1140 apache/commons-lang#1151 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47396 from panbingkun/SPARK-48932. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…) in test cases of `GeneratedSubquerySuite` ### What changes were proposed in this pull request? This PR aims to add some predicates(!=, <=, >, >=) which are not covered in test cases of `GeneratedSubquerySuite`. ### Why are the changes needed? Better coverage of current subquery tests in `GeneratedSubquerySuite`. For more information about subqueries in `postgresq`, refer to: https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA and Manual testing with `GeneratedSubquerySuite`. ![image](https://github.com/user-attachments/assets/4b265def-a7a9-405e-94ce-e9902efb79fa) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47386 from wayneguow/SPARK-48915. Authored-by: Wei Guo <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…WithTag` ### What changes were proposed in this pull request? This PR introduces the optional `reason` field for `cancelJobGroup` and `cancelJobsWithTag` in `SparkContext.scala`, while keeping the old APIs without the `reason`, similar to how `cancelJob` is implemented currently. ### Why are the changes needed? Today it is difficult to determine why a job, stage, or job group was canceled. We should leverage existing Spark functionality to provide a reason string explaining the cancellation cause, and should add new APIs to let us provide this reason when canceling job groups. **Details:** Since [SPARK-19549](https://issues.apache.org/jira/browse/SPARK-19549) Allow providing reasons for stage/job cancelling - ASF JIRA (Spark 2.20), Spark’s cancelJob and cancelStage methods accept an optional reason: String that is added to logging output and user-facing error messages when jobs or stages are canceled. In our internal calls to these methods, we should always supply a reason. For example, we should set an appropriate reason when the “kill” links are clicked in the Spark UI (see [code](https://github.com/apache/spark/blob/b14c1f036f8f394ad1903998128c05d04dd584a9/core/src/main/scala/org/apache/spark/ui/jobs/JobsTab.scala#L54C1-L55)). Other APIs currently lack a reason field. For example, cancelJobGroup and cancelJobsWithTag don’t provide any way to specify a reason, so we only see generic logs like “asked to cancel job group <group name>”. We should add an ability to pass in a group cancellation reason and thread that through into the scheduler’s logging and job failure reasons. This feature can be implemented in two PRs: 1. Modify the current SparkContext and its downstream APIs to add the reason string, such as cancelJobGroup and cancelJobsWithTag 2. Add reasons for all internal calls to these methods. **Note: This is the first of the two PRs to implement this new feature** ### Does this PR introduce _any_ user-facing change? Yes, it modifies the SparkContext API, allowing users to add an optional `reason: String` to `cancelJobsWithTags` and `cancelJobGroup`, while the old methods without the `reason` are also kept. This creates a more uniform interface where the user can supply an optional reason for all job/stage cancellation calls. ### How was this patch tested? New tests are added to `JobCancellationSuite` to test the reason fields for these calls. For the API changes in R and PySpark, tests are added to these files: - R/pkg/tests/fulltests/test_context.R - python/pyspark/tests/test_pin_thread.py ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47361 from mingkangli-db/reason_job_cancellation. Authored-by: Mingkang Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR migrates `src/main/scala/org/apache/spark/util/logging/FileAppender.scala` to comply with the scala style changes in apache#46947 ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tested by ensuring dev/scalastyle checks pass ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47394 from asl3/asl3/migratenewfiles. Authored-by: Amanda Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

…mproved structured logging for PySpark ### What changes were proposed in this pull request? This PR introduces the `pyspark.logger` module to facilitate structured client-side logging for PySpark users. This module includes a `PySparkLogger` class that provides several methods for logging messages at different levels in a structured JSON format: - `PySparkLogger.info` - `PySparkLogger.warning` - `PySparkLogger.error` The logger can be easily configured to write logs to either the console or a specified file. ## DataFrame error log improvement This PR also improves the DataFrame API error logs by leveraging this new logging framework: ### **Before** We introduced structured logging from apache#45729, but PySpark log is still hard to figure out in the current structured log, because it is hidden and mixed within bunch of complex JVM stacktraces and it's also not very Python-friendly: ```json { "ts": "2024-06-28T10:53:48.528Z", "level": "ERROR", "msg": "Exception in task 7.0 in stage 0.0 (TID 7)", "context": { "task_name": "task 7.0 in stage 0.0 (TID 7)" }, "exception": { "class": "org.apache.spark.SparkArithmeticException", "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/.../spark/python/test_error_context.py:17\n", "stacktrace": [ { "class": "org.apache.spark.sql.errors.QueryExecutionErrors$", "method": "divideByZeroError", "file": "QueryExecutionErrors.scala", "line": 203 }, { "class": "org.apache.spark.sql.errors.QueryExecutionErrors", "method": "divideByZeroError", "file": "QueryExecutionErrors.scala", "line": -1 }, { "class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1", "method": "project_doConsume_0$", "file": null, "line": -1 }, { "class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1", "method": "processNext", "file": null, "line": -1 }, { "class": "org.apache.spark.sql.execution.BufferedRowIterator", "method": "hasNext", "file": "BufferedRowIterator.java", "line": 43 }, { "class": "org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1", "method": "hasNext", "file": "WholeStageCodegenEvaluatorFactory.scala", "line": 50 }, { "class": "org.apache.spark.sql.execution.SparkPlan", "method": "$anonfun$getByteArrayRdd$1", "file": "SparkPlan.scala", "line": 388 }, { "class": "org.apache.spark.rdd.RDD", "method": "$anonfun$mapPartitionsInternal$2", "file": "RDD.scala", "line": 896 }, { "class": "org.apache.spark.rdd.RDD", "method": "$anonfun$mapPartitionsInternal$2$adapted", "file": "RDD.scala", "line": 896 }, { "class": "org.apache.spark.rdd.MapPartitionsRDD", "method": "compute", "file": "MapPartitionsRDD.scala", "line": 52 }, { "class": "org.apache.spark.rdd.RDD", "method": "computeOrReadCheckpoint", "file": "RDD.scala", "line": 369 }, { "class": "org.apache.spark.rdd.RDD", "method": "iterator", "file": "RDD.scala", "line": 333 }, { "class": "org.apache.spark.scheduler.ResultTask", "method": "runTask", "file": "ResultTask.scala", "line": 93 }, { "class": "org.apache.spark.TaskContext", "method": "runTaskWithListeners", "file": "TaskContext.scala", "line": 171 }, { "class": "org.apache.spark.scheduler.Task", "method": "run", "file": "Task.scala", "line": 146 }, { "class": "org.apache.spark.executor.Executor$TaskRunner", "method": "$anonfun$run$5", "file": "Executor.scala", "line": 644 }, { "class": "org.apache.spark.util.SparkErrorUtils", "method": "tryWithSafeFinally", "file": "SparkErrorUtils.scala", "line": 64 }, { "class": "org.apache.spark.util.SparkErrorUtils", "method": "tryWithSafeFinally$", "file": "SparkErrorUtils.scala", "line": 61 }, { "class": "org.apache.spark.util.Utils$", "method": "tryWithSafeFinally", "file": "Utils.scala", "line": 99 }, { "class": "org.apache.spark.executor.Executor$TaskRunner", "method": "run", "file": "Executor.scala", "line": 647 }, { "class": "java.util.concurrent.ThreadPoolExecutor", "method": "runWorker", "file": "ThreadPoolExecutor.java", "line": 1136 }, { "class": "java.util.concurrent.ThreadPoolExecutor$Worker", "method": "run", "file": "ThreadPoolExecutor.java", "line": 635 }, { "class": "java.lang.Thread", "method": "run", "file": "Thread.java", "line": 840 } ] }, "logger": "Executor" } ``` ### **After** Now we can get a improved, simplified and also Python-friendly error log for DataFrame errors: ```json { "ts": "2024-06-28 19:53:48,563", "level": "ERROR", "logger": "DataFrameQueryContextLogger", "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/.../spark/python/test_error_context.py:17\n", "context": { "file": "/.../spark/python/test_error_context.py", "line_no": "17", "fragment": "__truediv__" "error_class": "DIVIDE_BY_ZERO" }, "exception": { "class": "Py4JJavaError", "msg": "An error occurred while calling o52.showString.\n: org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22\n\n\tat org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)\n\tat org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)\n\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)\n\tat org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:333)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:146)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1007)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)\n\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:412)\n\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1051)\n\tat org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)\n\tat org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)\n\tat org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)\n\tat org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)\n\tat org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)\n\tat org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)\n\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)\n\tat org.apache.spark.sql.Dataset.head(Dataset.scala:3393)\n\tat org.apache.spark.sql.Dataset.take(Dataset.scala:3626)\n\tat org.apache.spark.sql.Dataset.getRows(Dataset.scala:294)\n\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:330)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:568)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\n\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n", "stacktrace": ["Traceback (most recent call last):", " File \"/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/errors/exceptions/captured.py\", line 272, in deco", " return f(*a, **kw)", " File \"/Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/py4j/protocol.py\", line 326, in get_return_value", " raise Py4JJavaError(", "py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.", ": org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012", "== DataFrame ==", "\"__truediv__\" was called from", "/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22", "", "\tat org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)", "\tat org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)", "\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)", "\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)", "\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)", "\tat org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)", "\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)", "\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)", "\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)", "\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)", "\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)", "\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:333)", "\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)", "\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)", "\tat org.apache.spark.scheduler.Task.run(Task.scala:146)", "\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)", "\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)", "\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)", "\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)", "\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)", "\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)", "\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)", "\tat java.base/java.lang.Thread.run(Thread.java:840)", "\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1007)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)", "\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)", "\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)", "\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)", "\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:412)", "\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1051)", "\tat org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)", "\tat org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)", "\tat org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)", "\tat org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)", "\tat org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)", "\tat org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)", "\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)", "\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)", "\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)", "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)", "\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)", "\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)", "\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)", "\tat org.apache.spark.sql.Dataset.head(Dataset.scala:3393)", "\tat org.apache.spark.sql.Dataset.take(Dataset.scala:3626)", "\tat org.apache.spark.sql.Dataset.getRows(Dataset.scala:294)", "\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:330)", "\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)", "\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)", "\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)", "\tat java.base/java.lang.reflect.Method.invoke(Method.java:568)", "\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)", "\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)", "\tat py4j.Gateway.invoke(Gateway.java:282)", "\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)", "\tat py4j.commands.CallCommand.execute(CallCommand.java:79)", "\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)", "\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)", "\tat java.base/java.lang.Thread.run(Thread.java:840)"] }, } ``` ### Why are the changes needed? **Before** Currently we don't have PySpark dedicated logging module so we have to manually set up and customize the Python logging module, for example: ```python logger = logging.getLogger("TestLogger") user = "test_user" action = "test_action" logger.info(f"User {user} takes an {action}") ``` This logs an information just in a following simple string: ``` INFO:TestLogger:User test_user takes an test_action ``` This is not very actionable, and it is hard to analyze not since it is not well-structured. Or we can use Log4j from JVM which resulting in excessively detailed logs as described in the above example, and this way even cannot be applied to Spark Connect. **After** We can simply import and use `PySparkLogger` with minimal setup: ```python from pyspark.logger import PySparkLogger logger = PySparkLogger.getLogger("TestLogger") user = "test_user" action = "test_action" logger.info(f"User {user} takes an {action}", user=user, action=action) ``` This logs an information in a following JSON format: ```json { "ts": "2024-06-28 19:44:19,030", "level": "WARNING", "logger": "TestLogger", "msg": "User test_user takes an test_action", "context": { "user": "test_user", "action": "test_action" }, } ``` **NOTE:** we can add as many keyword arguments as we want for each logging methods. These keyword arguments, such as `user` and `action` in the example, are included within the `"context"` field of the JSON log. This structure makes it easy to track and analyze the log. ### Does this PR introduce _any_ user-facing change? No API changes, but the PySpark client-side logging is improved. Also added user-facing documentation "Logging in PySpark": <img width="1395" alt="Screenshot 2024-07-16 at 5 40 41 PM" src="https://github.com/user-attachments/assets/c77236aa-1c6f-4b5b-ad14-26ccdc474f59"> Also added API reference: <img width="1417" alt="Screenshot 2024-07-16 at 5 40 58 PM" src="https://github.com/user-attachments/assets/6bb3fb23-6847-4086-8f4b-bcf9f4242724"> ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47145 from itholic/pyspark_logger. Authored-by: Haejoon Lee <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…aderWriterSuite` ### What changes were proposed in this pull request? This PR enabled test case `testOrcAPI` in `JavaDataFrameReaderWriterSuite` because this test no longer depends on Hive classes, we can test it like other test cases in this Suite. ### Why are the changes needed? Enable test case `testOrcAPI` in `JavaDataFrameReaderWriterSuite` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47400 from LuciferYang/minor-testOrcAPI. Authored-by: yangjie01 <[email protected]> Signed-off-by: yangjie01 <[email protected]>

…solveDataSource function ### What changes were proposed in this pull request? When reading csv, json and other files, pass the options parameter to the rules.resolveDataSource method to make the options parameter effective. This is a bug fix for [apache#46707](apache#46707) szehon-ho ### Why are the changes needed? For the following SQL, the options parameter passed in does not take effect. This is because the rules.resolveDataSource method does not pass the options parameter during the datasource construction process ``` SELECT * FROM csv.`/test/data.csv` WITH (`header` = true, 'delimiter' = '|') ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test in SQLQuerySuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47370 from logze/hint-options. Authored-by: lizongze <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? `SET` statement is used to set config values and it has a poorly designed grammar rule `#setConfiguration` that matches everything after `SET` - `SET .*?`. This conflicts with the usage of `SET` for setting session variables, and we needed to introduce `SET (VAR | VARIABLE)` grammar rule to make distinction between setting the config values and session variables - [SET VAR pull request](apache#40474). However, this is not by SQL standard, so for SQL scripting ([JIRA](https://issues.apache.org/jira/browse/SPARK-48338)) we are opting to disable `SET` for configs and use it only for session variables. This enables use to use only `SET` for setting values to session variables. Config values can still be set from SQL scripts using `EXECUTE IMMEDIATE`. This change simply reorders grammar rules to achieve above behavior, and alters only visitor functions where name of the rule had to be changed or completely new rule was added. ### Why are the changes needed? These changes are supposed to resolve the issues poorly designed `SET` statement for the case of SQL scripts. ### Does this PR introduce _any_ user-facing change? No. This PR is in a series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged. In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts. ### How was this patch tested? Already existing tests should cover the changes. New tests for SQL scripts were added to: - `SqlScriptingParserSuite` - `SqlScriptingInterpreterSuite` ### Was this patch authored or co-authored using generative AI tooling? Closes apache#47272 from davidm-db/sql_scripting_set_statement. Authored-by: David Milicevic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…g4j ThreadContext ### What changes were proposed in this pull request? There are some special informations needed for structured streaming queries. Specifically, each query has a query_id and run_id. Also if using MicrobatchExecution (default), there is a batch_id. A (query_id, run_id, batch_id) identifies the microbatch the streaming query runs. Adding these field to a threadContext would help especially when there are multiple queries running. ### Why are the changes needed? Logging improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run a streaming query through spark-submit, here are sample logs (search for query_id, run_id, or batch_id): ``` {"ts":"2024-07-15T19:56:01.577Z","level":"INFO","msg":"Starting new streaming query.","context":{"query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"MicroBatchExecution"} {"ts":"2024-07-15T19:56:01.579Z","level":"INFO","msg":"Stream started from {}","context":{"query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","streaming_offsets_start":"{}"},"logger":"MicroBatchExecution"} {"ts":"2024-07-15T19:56:01.602Z","level":"INFO","msg":"Writing atomically to file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0 using temp file file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp","context":{"final_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","temp_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp"},"logger":"CheckpointFileManager"} {"ts":"2024-07-15T19:56:01.675Z","level":"INFO","msg":"Renamed temp file file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp to file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","context":{"final_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","temp_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp"},"logger":"CheckpointFileManager"} {"ts":"2024-07-15T19:56:01.676Z","level":"INFO","msg":"Committed offsets for batch 0. Metadata OffsetSeqMetadata(0,1721073361582,HashMap(spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider, spark.sql.streaming.stateStore.rocksdb.formatVersion -> 5, spark.sql.streaming.statefulOperator.useStrictDistribution -> true, spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion -> 2, spark.sql.streaming.multipleWatermarkPolicy -> min, spark.sql.streaming.aggregation.stateFormatVersion -> 2, spark.sql.shuffle.partitions -> 200, spark.sql.streaming.join.stateFormatVersion -> 2, spark.sql.streaming.stateStore.compression.codec -> lz4))","context":{"batch_id":"0","offset_sequence_metadata":"OffsetSeqMetadata(0,1721073361582,HashMap(spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider, spark.sql.streaming.stateStore.rocksdb.formatVersion -> 5, spark.sql.streaming.statefulOperator.useStrictDistribution -> true, spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion -> 2, spark.sql.streaming.multipleWatermarkPolicy -> min, spark.sql.streaming.aggregation.stateFormatVersion -> 2, spark.sql.shuffle.partitions -> 200, spark.sql.streaming.join.stateFormatVersion -> 2, spark.sql.streaming.stateStore.compression.codec -> lz4))","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"MicroBatchExecution"} {"ts":"2024-07-15T19:56:02.074Z","level":"INFO","msg":"Code generated in 97.122375 ms","context":{"batch_id":"0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","total_time":"97.122375"},"logger":"CodeGenerator"} {"ts":"2024-07-15T19:56:02.125Z","level":"INFO","msg":"Start processing data source write support: MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.datasources.noop.NoopStreamingWrite$20ba1e29]. The input RDD has 1} partitions.","context":{"batch_id":"0","batch_write":"MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.datasources.noop.NoopStreamingWrite$20ba1e29]","count":"1","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"WriteToDataSourceV2Exec"} {"ts":"2024-07-15T19:56:02.129Z","level":"INFO","msg":"Starting job: start at NativeMethodAccessorImpl.java:0","context":{"batch_id":"0","call_site_short_form":"start at NativeMethodAccessorImpl.java:0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"SparkContext"} {"ts":"2024-07-15T19:56:02.135Z","level":"INFO","msg":"Got job 0 (start at NativeMethodAccessorImpl.java:0) with 1 output partitions","context":{"call_site_short_form":"start at NativeMethodAccessorImpl.java:0","job_id":"0","num_partitions":"1"},"logger":"DAGScheduler"} ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47340 from WweiL/structured-logging-streaming-id-aware. Authored-by: Wei Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

…or MergeInto ### What changes were proposed in this pull request? We got a customer issue that a `MergeInto` query on Iceberg table works earlier but cannot work after upgrading to Spark 3.4. The error looks like ``` Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object upcast(getcolumnbyordinal(0, StringType), StringType, - root class: java.lang.String).toString. ``` The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark invokes the deserializer of input encoder of the `ScalaUDF` and the deserializer is not resolved yet. The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` which will be applied at the end of analysis phase. During rewriting `MergeInto` to `ReplaceData` query, Spark creates an `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note that the `ScalaUDF` is already resolved by the analyzer. Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve the subquery plan if it is not resolved yet. Because the subquery containing `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with encoders unresolved that cause the error. This patch modifies `ResolveSubquery` so it will resolve subquery plan if it is not analyzed to make sure subquery plan is fully analyzed. This patch moves `ResolveEncodersInUDF` rule before rewriting `MergeInto` to make sure the `ScalaUDF` in the subquery plan is fully analyzed. ### Why are the changes needed? Fixing production query error. ### Does this PR introduce _any_ user-facing change? Yes, fixing user-facing issue. ### How was this patch tested? Manually test with `MergeInto` query and add an unit test. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47380 from viirya/fix_subquery_resolve. Lead-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ting timeout in applyInPandasWithState ### What changes were proposed in this pull request? Fix the way applyInPandasWithState's setTimeoutTimestamp() handles argument of datetime ### Why are the changes needed? In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in with datetime.datetime type, it doesn't function as expected. Fix it. Also, fix another bug of reporting VALUE_NOT_POSITIVE. This issue will trigger when the converted value is 0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add unit test coverage for thie scenario ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47398 from siying/state_set_timeout. Lead-authored-by: Siying Dong <[email protected]> Co-authored-by: Siying Dong <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

Minor change that shouldn't require a Jira to fix the unbalanced row in the example of Shredding.md Closes apache#47407 from RussellSpitzer/patch-1. Authored-by: Russell Spitzer <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

… <, <=, >, >=) for correlation in `GeneratedSubquerySuite` ### What changes were proposed in this pull request? In PR apache#47386, we improves coverage of predicate types of scalar subquery in the WHERE clause. Follow up, this PR as aims to add some uncovered predicates(!=, <, <=, >, >=) for correlation in `GeneratedSubquerySuite`. ### Why are the changes needed? Better coverage of current subquery tests with correlation in `GeneratedSubquerySuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47399 from wayneguow/SPARK-48915_follow_up. Authored-by: Wei Guo <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ehavior change of base64 function" This reverts commit b2e0a4d.

### What changes were proposed in this pull request? The pr aims to upgrade `protobuf-java` from `3.25.1` to `3.25.3`. ### Why are the changes needed? - v3.25.1 VS v.3.25.3: protocolbuffers/protobuf@v3.25.1...v3.25.3 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47397 from panbingkun/SPARK-48933. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? The pr aims to upgrade `arrow` from `16.1.0` to `17.0.0`. ### Why are the changes needed? The full release notes: https://arrow.apache.org/release/17.0.0.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47409 from panbingkun/SPARK-48940. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This is a followup of apache#46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47412 from cloud-fan/char. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ng.enabled from a legacy/internal config to a regular/public one ### What changes were proposed in this pull request? + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one. + Add test cases for unbase64 ### Why are the changes needed? Keep the same behavior as before. More details: apache#47303 (comment) ### Does this PR introduce _any_ user-facing change? yes, revert behavior change introduced in apache#47303 ### How was this patch tested? existing unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47410 from wForget/SPARK-47307_followup. Lead-authored-by: wforget <[email protected]> Co-authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>

### What changes were proposed in this pull request? If we call DataSourceV2ScanExecBase redact method from a thread that don't have a session in thread local we get an NPE. Getting stringRedactionPattern from conf could prevent this problem as conf checks if session is null or not. We also use this in DataSourceScanExec trait. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L93-L95 ### Why are the changes needed? To prevent NPE when session is null. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UTs. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47419 from mikoszilard/SPARK-48946. Authored-by: Szilard Miko <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR enables TWS operator writes the "real" schema of the state variables that is initialized on the executors to be written to the `StateSchemaV3` that is being written by drivers. We'll integrate the SQL schema of the state variables with this [StateSchemaV3 implementation PR](apache#47104). ### Why are the changes needed? When reloading the state after query restart, we'll need the schema/encoder of the state variables before restart. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Integration tests in `TransformWithStateSuite` and `TransformWithStateTTLSuite` that tests with all state variable types to have correct schema. Existing integration tests in `TransformWith*State(TTL)Suite` for verifying SQL serialization is correct. Existing unit test suites & newly added unit suites in `ValueStateSuite`, `ListStateSuite`, `MapStateSuite`, `TimerSuite` for non-primitive types. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47257 from jingz-db/metadata-schema-compatible. Authored-by: jingz-db <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

### What changes were proposed in this pull request? Simplify a group of function with `lit` ### Why are the changes needed? code clean up, these branchings are not necessary ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47417 from zhengruifeng/py_func_simplity_lit. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ct Server ### What changes were proposed in this pull request? Simplify the JSON-format schema handling in Connect Server, by introducing a helper function `extractDataTypeFromJSON` ### Why are the changes needed? to unify the schema handling ### Does this PR introduce _any_ user-facing change? No, minor refactoring ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47415 from zhengruifeng/simplfy_from_json. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…UDTFs ### What changes were proposed in this pull request? This PR improves the error messages when registering Python UDTFs. Before this PR: ```python class TestUDTF: ... spark.udtf.register("test_udtf", TestUDTF) ``` This fails with ``` AttributeError: type object "TestUDTF" has no attribute "evalType" ``` After this PR: ```python spark.udtf.register("test_udtf", TestUDTF) ``` Now we have a nicer error: ``` [CANNOT_REGISTER_UDTF] Cannot register the UDTF 'test_udtf': expected a 'UserDefinedTableFunction'. Please make sure the UDTF is correctly defined as a class, and then either wrap it in the `udtf()` function or annotate it with `udtf(...)`.` ``` ### Why are the changes needed? To improve usability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47408 from allisonwang-db/spark-48938-udtf-register-err-msg. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…orkflow ### What changes were proposed in this pull request? This PR checks for Scala logging messages using logInfo, logWarning, logError and containing variables without MDC wrapper Example error output: ``` [error] spark/mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:225:4 [error] Logging message should use Structured Logging Framework style, such as log"...${MDC(TASK_ID, taskId)..." Refer to the guidelines in the file `internal/Logging.scala`. ``` ### Why are the changes needed? This makes development and PR review of the structured logging migration easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test, verified it will throw errors on invalid logging messages. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47239 from asl3/structuredlogstylescript. Authored-by: Amanda Liu <[email protected]> Signed-off-by: Gengliang Wang <[email protected]>

### What changes were proposed in this pull request? for consistency try_remainder() gets renamed to try_mod(). this is Spark 4.0.0 only, so no need for config. ### Why are the changes needed? To keep consistent naming. ### Does this PR introduce _any_ user-facing change? Yes, replaces try_remainder() with try_mod() ### How was this patch tested? Existing try_remainder() tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47427 from srielau/SPARK-48954-try-mod. Authored-by: Serge Rielau <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

…ll state schema formats ### What changes were proposed in this pull request? Refactor StateSchemaCompatibilityChecker to unify all state schema formats ### Why are the changes needed? Needed to integrate future changes around state data source reader and schema evolution and consolidate these changes - Consolidates all state schema reader/writers in one place - Consolidates all validation logic through the same API ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests ``` 12:38:45.481 WARN org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite: ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite, threads: rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true) ===== [info] Run completed in 12 seconds, 565 milliseconds. [info] Total number of tests run: 30 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 30, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47359 from anishshri-db/task/SPARK-48891. Authored-by: Anish Shrigondekar <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…StateExec operator

…= false` ### What changes were proposed in this pull request? `ArrayCompact`'s datatype should be `containsNull = false` ### Why are the changes needed? `ArrayCompact` - Removes null values from the array ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test before: ``` scala> val df = spark.range(1).select(lit(Array(1,2,3)).alias("a")) val df: org.apache.spark.sql.DataFrame = [a: array<int>] scala> df.printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- a: array (nullable = false) | |-- element: integer (containsNull = true) scala> df.select(array_compact(col("a"))).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- array_compact(a): array (nullable = false) | |-- element: integer (containsNull = true) ``` after ``` scala> df.select(array_compact(col("a"))).printSchema warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` root |-- array_compact(a): array (nullable = false) | |-- element: integer (containsNull = false) ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47430 from zhengruifeng/sql_array_compact_data_type. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…mestamp` ### What changes were proposed in this pull request? Fix type hint for `from_utc_timestamp` and `to_utc_timestamp` ### Why are the changes needed? the str type input should be treated as literal string, instead of column name ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47429 from zhengruifeng/py_fix_hint_202407. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? This RP aims to fix some typos in `LZFBenchmark`. ### Why are the changes needed? Fix typos and avoid confusion. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47435 from wayneguow/lzf. Authored-by: Wei Guo <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…k` selectable ### What changes were proposed in this pull request? The pr aims to make the `input parameters` of `workflows/benchmark` selectable. ### Why are the changes needed? - Before: <img width="311" alt="image" src="https://github.com/user-attachments/assets/da93ea8f-8791-4816-a5d9-f82c018fa819"> - After: https://github.com/panbingkun/spark/actions/workflows/benchmark.yml <img width="318" alt="image" src="https://github.com/user-attachments/assets/0b9b01a0-96f6-4630-98d9-7d2709aafcd0"> ### Does this PR introduce _any_ user-facing change? Yes, Convenient for developers to run `workflows/benchmark`, transforming input values from only `tex`t to `selectable values`. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47438 from panbingkun/improve_workflow_dispatch. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…h Dataframe read / write API ### What changes were proposed in this pull request? PysparkML: Replace RDD read / write API invocation with Dataframe read / write API ### Why are the changes needed? Follow-up of apache#47341 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up. Authored-by: Weichen Xu <[email protected]> Signed-off-by: Weichen Xu <[email protected]>

…baseException` to restore the exception handling ### What changes were proposed in this pull request? Make `NoSuchNamespaceException` extend `NoSuchNamespaceException` ### Why are the changes needed? 1, apache#47276 made many SQL commands throw `NoSuchNamespaceException` instead of `NoSuchDatabaseException`, it is more then an end-user facing change, it is a breaking change which break the exception handling in 3-rd libraries in the eco-system. 2, `NoSuchNamespaceException` and `NoSuchDatabaseException` actually share the same error class `SCHEMA_NOT_FOUND` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47433 from zhengruifeng/make_nons_nodb. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…utors.py ### What changes were proposed in this pull request? Support JIRA_ACCESS_TOKEN in translate-contributors.py ### Why are the changes needed? Remove plaintext password in JIRA_PASSWORD environment variable to prevent password leakage ### Does this PR introduce _any_ user-facing change? no, infra only ### How was this patch tested? Ran translate-contributors.py with 3.5.2 RC ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47440 from yaooqinn/SPARK-48963. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? The pr aims to upgrade `zstd-jni` from `1.5.6-3` to `1.5.6-4`. ### Why are the changes needed? 1.v1.5.6-3 VS v1.5.6-4 luben/zstd-jni@v1.5.6-3...v1.5.6-4 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47432 from panbingkun/SPARK-48958. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State source value #4

State source value #4

Commits on Jul 5, 2024

Commits on Jul 22, 2024