Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State source value #4

Closed
wants to merge 169 commits into from
Closed

State source value #4

wants to merge 169 commits into from

Conversation

jingz-db
Copy link
Owner

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

anishshri-db and others added 30 commits July 5, 2024 11:24
…urces-parquet.md` doc

### What changes were proposed in this pull request?

This PR aims to update parquet version in `sql-data-sources-parquet.md` doc.

### Why are the changes needed?

In order to keep consistent with the version of parquet in dependencies.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA and manually confirmed that the new links can be opened.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47242 from wayneguow/SPARK-48177.

Authored-by: Wei Guo <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…TIES ...` in v1 and v2

### What changes were proposed in this pull request?
The pr aims to:
- align the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` in v1 and v2.
(this means that in the v1, regardless of whether `IF EXISTS` is specified or not, when unset a `non-existent` property, it is `ignored` and no longer `fails`.)
- update the description of `ALTER TABLE ... UNSET TBLPROPERTIES ...` in the doc `docs/sql-ref-syntax-ddl-alter-table.md`.
- unify v1 and v2 `ALTER TABLE ... UNSET TBLPROPERTIES ...` tests.
- Add the following `scenario` for `ALTER TABLE ... SET TBLPROPERTIES ...` testing
A.`table to alter does not exist`
B.`alter table set reserved properties`

### Why are the changes needed?
- align the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` in v1 and v2, avoid confusing end-users.
- to improve test coverage.
- align with other similar tests, eg: `AlterTableSetTblProperties*`

### Does this PR introduce _any_ user-facing change?
Yes, in the `v1`, regardless of whether `IF EXISTS` is specified or not, when unset a `non-existent` property, it is `ignored` and no longer `fails`

### How was this patch tested?
Update some UT & Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47097 from panbingkun/alter_unset_table.

Authored-by: panbingkun <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request?

The listener test in `ClientStreamingQuerySuite` is flaky.

For client side listeners, the terminated events might take a while before arriving to the client. This test is currently flaky, example: https://github.com/anishshri-db/spark/actions/runs/9785389228/job/27018350836

This PR tries to deflake it by waiting for a longer time.

### Why are the changes needed?

Deflake test

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Test only change

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47205 from WweiL/deflake-listener-client-scala.

Authored-by: Wei Liu <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
…causes batch with no files to be processed

### What changes were proposed in this pull request?

This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0

### Why are the changes needed?

Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.

### Does this PR introduce _any_ user-facing change?

Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0

### How was this patch tested?

New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.

Lead-authored-by: ragnarok56 <[email protected]>
Co-authored-by: Kevin Nacios <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
…t fail if the session is already closed by the server

### What changes were proposed in this pull request?

Improve the error handling of the `stop()` API in the `SparkSesion`
class to not throw if there is any error related to releasing a session or
closing the underlying GRPC channel. Both are best effort.

In the case of Pyspark, do not fail if the local Spark Connect service
cannot be stopped.

### Why are the changes needed?

In some cases, the Spark Connect Service will terminate the session, usually
because the underlying cluster or driver has restarted.
In the cases, calling stop() throws an error which is unactionable. However,
stop() still needs to be called in order to reset the active session.

Further, the stop() API should be idempotent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Attached unit tests.

Confirmed that removing the code changes results in the
tests failing (as expected).

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47215 from nija-at/session-stop.

Authored-by: Niranjan Jayakar <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
 Simplify `percentile` functions

### Why are the changes needed?
existing implementations are unnecessarily complicated

### Does this PR introduce _any_ user-facing change?
No, minor refactoring

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47225 from zhengruifeng/func_refactor_1.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…Spark docstrings

### What changes were proposed in this pull request?

This PR unifies the 'See Also' section formatting across PySpark docstrings and fixes some invalid references.

### Why are the changes needed?

To improve PySpark documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

doctest

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47240 from allisonwang-db/spark-48825-also-see-docs.

Authored-by: allisonwang-db <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?
Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:
- `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase:
  - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement.
  - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements.
  - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node.
- `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.

Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well).

### Why are the changes needed?
The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.

Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.

### Does this PR introduce _any_ user-facing change?
No.
This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.

### How was this patch tested?
There are tests for newly introduced parser changes:
- `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes.
- `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration).

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47026 from davidm-db/sql_scripting_interpreter.

Authored-by: David Milicevic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

In this pull request i propose to change default ISO pattern we use for formatting timestamps when we are writing to json,xml and/or csv as well as when to_(xml|json|csv) is used.

Older timestamps sometimes have offsets that contain seconds part as well. Current default formatting used is omitting seconds hence providing wrong results.

e.g.
```
sql("SET spark.sql.session.timeZone=America/Los_Angeles")
sql("SELECT to_json(struct(CAST('1800-01-01T00:00:00+00:00' AS TIMESTAMP) AS ts))").show(false)
{"ts":"1799-12-31T16:07:02.000-07:52"}
```

### Why are the changes needed?

This is correctness issue.

### Does this PR introduce _any_ user-facing change?

Yes, users will now see different results for older timestamps (correct ones).

### How was this patch tested?

Tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47177 from milastdbx/dev/milast/fixJsonTimestampHandling.

Authored-by: milastdbx <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…on-based profiling

### What changes were proposed in this pull request?

Introduces `spark.profile.render` for SparkSession-based profiling.

It uses [`flameprof`](https://github.com/baverman/flameprof/) for the default renderer.

```
$ pip install flameprof
```

run `pyspark` on Jupyter notebook:

```py
from pyspark.sql.functions import pandas_udf

spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")

df = spark.range(10)

pandas_udf("long")
def add1(x):
    return x + 1

added = df.select(add1("id"))
added.show()

spark.profile.render(id=2)
```

<img width="1103" alt="pyspark-udf-profile" src="https://github.com/apache/spark/assets/506656/795972e8-f7eb-4b89-89fc-3d8d18b86541">

On CLI, it will return `svg` source string.

```py
'<?xml version="1.0" standalone="no"?>\n<!DOCTYPE svg  ...
```

Currently only `renderer="flameprof"` for `type="perf"` is supported as a builtin renderer.

You can also pass an arbitrary renderer.

```py
def render_perf(stats):
    ...
spark.profile.render(id=2, type="perf", renderer=render_perf)

def render_memory(codemap):
    ...
spark.profile.render(id=2, type="memory", renderer=render_memory)
```

### Why are the changes needed?

Better debuggability.

### Does this PR introduce _any_ user-facing change?

Yes, `spark.profile.render` will be available.

### How was this patch tested?

Added/updated the related tests, and manually.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47202 from ueshin/issues/SPARK-48798/render.

Authored-by: Takuya Ueshin <[email protected]>
Signed-off-by: Takuya Ueshin <[email protected]>
### What changes were proposed in this pull request?
The pr aims to eliminating warnings for panda: `<string>:5: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.`

### Why are the changes needed?
Only eliminating warnings for panda
https://github.com/panbingkun/spark/actions/runs/9795675050/job/27048513673
<img width="856" alt="image" src="https://github.com/apache/spark/assets/15246973/ea70e922-897e-450f-b150-3d38d7f20930">

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47222 from panbingkun/remove_pandas_warning.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request?

We can eliminate the use of mutable.ArrayBuffer by using `flatmap`.

### Why are the changes needed?

Code simplification and optimization.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47185 from amaliujia/followup_cte.

Lead-authored-by: Rui Wang <[email protected]>
Co-authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

Introducing virtual column family to RocksDB. We attach an 2-byte-Id prefix as column family identifier for each of the key row that is put into RocksDB. The encoding and decoding of the virtual column family prefix happens at the `RocksDBKeyEncoder` layer as we can pre-allocate extra 2 bytes and avoid additional memcpy.

- Remove Physical Column Family related codes as this becomes potentially dead code till some caller starts using this.
- Remove `useColumnFamilies` from `StateStoreChangelogV2` API.

### Why are the changes needed?

Currently within  the scope of the arbitrary stateful API v2 (transformWithState)  project, each state variable is stored inside one [physical column family](https://github.com/facebook/rocksdb/wiki/Column-Families) within the RocksDB state store instance. Column families are also used to implement secondary indexes for various features. Each physical column family has its own memtables, creates its own SST files, and handles  compaction independently on those independent SST files.

When the number of operations to RocksDB is relatively small and the number of column families is relatively large, the overhead of handling small SST files becomes high, especially since all of these have to be uploaded in the snapshot dir and referenced in the metadata file for the uploaded RocksDB snapshot. Using prefix to manage different key spaces / virtual column family could reduce such overheads.

### Does this PR introduce _any_ user-facing change?

No. If `useColumnFamilies` are set to true in the `StateStore.init()`, virtual column family will be used.

### How was this patch tested?

Unit tests in `RocksDBStateStoreSuite`, and integration tests in `TransformWithStateSuite`.
Moved test suites in `RocksDBSuite` into `RocksDBStateStoreSuite` because some previous verification functions are now moved into `RocksDBStateProvider`

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47107 from jingz-db/virtual-col-family.

Lead-authored-by: jingz-db <[email protected]>
Co-authored-by: Jing Zhan <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request?
The pr aims to upgrade `kubernetes-client` from `6.13.0` to `6.13.1`.

### Why are the changes needed?
- The full release notes: https://github.com/fabric8io/kubernetes-client/releases/tag/v6.13.1
- The newest version fixed some bug, eg:
  Fix fabric8io/kubernetes-client#6059: Swallow rejected execution from internal usage of the informer executor
  Fix fabric8io/kubernetes-client#6068: KubernetesMockServer provides incomplete Configuration while creating test Config for KubernetesClient
  Fix fabric8io/kubernetes-client#6085: model getters have same annotations as fields (breaks native)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47206 from panbingkun/SPARK-48801.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… per transform, not once per row

### What changes were proposed in this pull request?

apache#11536 added a `binary` toggle parameter to `CountVectorizer`, but the parameter evaluation occurs inside of the vectorizer UDF itself: this causes expensive parameter reading to occur once-per-row instead of once-per-transform.

This PR addresses this issue by updating the code to only read the parameter once, similar to what was already being done for the `minTf` parameter.

### Why are the changes needed?

Address a performance issue.

I spotted this issue when I saw the stack

```scala
[...]
at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:204)
at scala.collection.IndexedSeqOptimized.exists(IndexedSeqOptimized.scala:49)
at scala.collection.IndexedSeqOptimized.exists$(IndexedSeqOptimized.scala:49)
at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:198)
at org.apache.spark.ml.param.Params.hasParam(params.scala:701)
at org.apache.spark.ml.param.Params.hasParam$(params.scala:700)
at org.apache.spark.ml.PipelineStage.hasParam(Pipeline.scala:42)
at org.apache.spark.ml.param.Params.shouldOwn(params.scala:856)
at org.apache.spark.ml.param.Params.get(params.scala:739)
at org.apache.spark.ml.param.Params.get$(params.scala:738)
at org.apache.spark.ml.PipelineStage.get(Pipeline.scala:42)
at org.apache.spark.ml.param.Params.getOrDefault(params.scala:759)
at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:757)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
at org.apache.spark.ml.param.Params.$(params.scala:766)
at org.apache.spark.ml.param.Params.$$(params.scala:766)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
at org.apache.spark.ml.feature.CountVectorizerModel.$anonfun$transform$1(CountVectorizer.scala:326)
at org.apache.spark.ml.feature.CountVectorizerModel$$Lambda$12153/1200761496.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source)
[...]
```

while investigating an unrelated issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47258 from JoshRosen/CountVectorizer-conf.

Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… with ParquetWriteSupport

### What changes were proposed in this pull request?

Kind of follow-up of apache#44275, this PR aligned 2 similar code paths with different error messages into one.

```java
24/07/03 16:29:01 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: [INTERNAL_ERROR] Unsupported data type VarcharType(64). SQLSTATE: XX000
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
```

```java
org.apache.spark.SparkUnsupportedOperationException: VarcharType(64) is not supported yet.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.dataTypeUnsupportedYetError(QueryExecutionErrors.scala:993)
	at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.newConverter(OrcSerializer.scala:209)
	at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.$anonfun$converters$2(OrcSerializer.scala:35)
	at scala.collection.immutable.List.map(List.scala:247)
```

### Why are the changes needed?

improvement

### Does this PR introduce _any_ user-facing change?

No, users shouldn't face such errors in regular cases.
### How was this patch tested?

passing existing tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47208 from yaooqinn/SPARK-48803.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
This PR makes additional Scala logging migrations to comply with the scala style changes in apache#46947

### Why are the changes needed?
This makes development and PR review of the structured logging migration easier.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Tested by ensuring dev/scalastyle checks pass

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47256 from asl3/morestructuredloggingmigrations.

Authored-by: Amanda Liu <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
…From check for output committer class configrations

### What changes were proposed in this pull request?

This pull request proposed adding a checker for class values provided by users in `spark.sql.sources.outputCommitterClass` and `spark.sql.parquet.output.committer.class` to make sure the given class is visible from the classpath and a subclass of `org.apache.hadoop.mapreduce.OutputCommitter`

### Why are the changes needed?

Ensure that an invalid configuration results in immediate application or query failure rather than failing late during setupJob.
### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47209 from yaooqinn/SPARK-48804.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
…eness` for large query plans

### What changes were proposed in this pull request?

This PR rewrites `LogicalPlanIntegrity.hasUniqueExprIdsForOutput` to only traverse the query plan once and avoids expensive Scala collections operations like `.flatten`, `.groupBy`, and `.distinct`.

### Why are the changes needed?

Speeds up query compilation when plan validation is enabled.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Made sure existing UTs pass.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47170 from kelvinjian-db/SPARK-48771-speed-up.

Authored-by: Kelvin Jiang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?

Directly call `IntervalUtils.castStringToDTInterval/castStringToYMInterval` instead of creating Cast expressions to evaluate.

- Benchmarks indicated a 10% time-saving.
- Bad record recording might not work if the cast handles the exceptions early

### Why are the changes needed?

- pref improvement
- Bugfix for bad record recording

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing existing tests and benchmark tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47227 from yaooqinn/SPARK-48816.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Kent Yao <[email protected]>
### What changes were proposed in this pull request?

This PR amis to upgrade `fasterxml.jackson` from 2.17.1 to 2.17.2.

### Why are the changes needed?

There are some bug fixes about [Databind](https://github.com/FasterXML/jackson-databind):
[apache#4561](FasterXML/jackson-databind#4561): Issues using jackson-databind 2.17.1 with Reactor (wrt DeserializerCache and ReentrantLock)
[apache#4575](FasterXML/jackson-databind#4575): StdDelegatingSerializer does not consider a Converter that may return null for a non-null input
[apache#4577](FasterXML/jackson-databind#4577): Cannot deserialize value of type java.math.BigDecimal from String "3." (not a valid representation)
[apache#4595](FasterXML/jackson-databind#4595): No way to explicitly disable wrapping in custom annotation processor
[apache#4607](FasterXML/jackson-databind#4607): MismatchedInput: No Object Id found for an instance of X to assign to property 'id'
[apache#4610](FasterXML/jackson-databind#4610): DeserializationFeature.FAIL_ON_UNRESOLVED_OBJECT_IDS does not work when used with Polymorphic type handling

The full release note of 2.17.2:
https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.17.2

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47241 from wayneguow/upgrade_jackson.

Authored-by: Wei Guo <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
…ee_disk_space_container`

### What changes were proposed in this pull request?
This PR removed the check for the existence of `./dev/free_disk_space_container` before execution,  because `./dev/free_disk_space_container` has already been backported to branch-3.4 and branch-3.5 through apache#45624 and apache#43381,  so there is no need to check its existence before execution.

### Why are the changes needed?
Remove unnecessary existence check for `./dev/free_disk_space_container`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47263 from LuciferYang/SPARK-48840.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
…en getStruct returns null

### What changes were proposed in this pull request?

The getStruct() method used in `MergingSessionIterator.initialize` could return a null value. When that happens, the copy() called upon it throws a NullPointerException.

We see an exception thrown there:
```
ava.lang.NullPointerException: <Redacted Exception Message>
	at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.initialize(MergingSessionsIterator.scala:121)
	at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.<init>(MergingSessionsIterator.scala:130)
	at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1(MergingSessionsExec.scala:93)
	at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1$adapted(MergingSessionsExec.scala:72)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:920)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:920)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:189)
	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:154)
	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:148)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.scheduler.Task.run(Task.scala:101)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:984)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:987)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:879)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

```

It is still not clear why that field could be null, but in general Spark should not throw NPEs. So this PR purposes to wrap it with SparkException.internalError with more details.

### Why are the changes needed?

Improvemtns

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is a hard-to repro issue. The change should not cause any harm.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47134 from WweiL/SPARK-48743-mergingSessionIterator-null-init.

Authored-by: Wei Liu <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?
The pr aims to fix some typos in some docs, includes: `docs/sql-ref-syntax-qry-star.md`, `docs/running-on-kubernetes.md` and `connector/profiler/README.md`.

### Why are the changes needed?
https://spark.apache.org/docs/4.0.0-preview1/sql-ref-syntax-qry-star.html
In some `sql examples` in the doc `docs/sql-ref-syntax-qry-star.md`, `Unicode Character 'SINGLE QUOTATION MARK'` was used, which resulted in the end-user being unable to execute successfully after `copy-paste`, eg:
<img width="660" alt="image" src="https://github.com/apache/spark/assets/15246973/055aa0a8-602e-4ea7-a065-c8e0353c6fb3">

### Does this PR introduce _any_ user-facing change?
Yes, the end-users will face more user-friendly docs.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47261 from panbingkun/fix_typo_docs.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…with Spark Classic

### What changes were proposed in this pull request?

I think there are two issues regarding the default column name of `cast`:
1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic,
```
scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- id: string (nullable = false)
 |-- CAST(1 AS STRING): string (nullable = false)
 |-- id: long (nullable = false)
 |-- CAST(1 AS BIGINT): long (nullable = false)
```

2, the column name is not consistent between Spark Connect and Spark Classic.

This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation
https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212

### Why are the changes needed?
the default column name is not consistent with the spark classic

### Does this PR introduce _any_ user-facing change?
yes,

spark classic:
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

spark connect (before):
```
In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+---------+---+---+-----+
|X'313233'|123|123|  123|
+---------+---+---+-----+
|      123|123|123|123.0|
+---------+---+---+-----+
```

spark connect (after):
```
In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
+-------------------------+-------------------+-------------------+-------------------+
|CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
+-------------------------+-------------------+-------------------+-------------------+
|                      123|                123|                123|              123.0|
+-------------------------+-------------------+-------------------+-------------------+
```

### How was this patch tested?
added test

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47249 from zhengruifeng/py_fix_cast.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
anishshri-db and others added 11 commits July 22, 2024 10:06
…ll state schema formats

### What changes were proposed in this pull request?
Refactor StateSchemaCompatibilityChecker to unify all state schema formats

### Why are the changes needed?
Needed to integrate future changes around state data source reader and schema evolution and consolidate these changes

- Consolidates all state schema reader/writers in one place
- Consolidates all validation logic through the same API

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests

```
12:38:45.481 WARN org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite:

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite, threads: rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true) =====
[info] Run completed in 12 seconds, 565 milliseconds.
[info] Total number of tests run: 30
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 30, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47359 from anishshri-db/task/SPARK-48891.

Authored-by: Anish Shrigondekar <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
…= false`

### What changes were proposed in this pull request?
`ArrayCompact`'s datatype should be `containsNull = false`

### Why are the changes needed?
`ArrayCompact` - Removes null values from the array

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added test

before:
```
scala> val df = spark.range(1).select(lit(Array(1,2,3)).alias("a"))
val df: org.apache.spark.sql.DataFrame = [a: array<int>]

scala> df.printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- a: array (nullable = false)
 |    |-- element: integer (containsNull = true)

scala> df.select(array_compact(col("a"))).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- array_compact(a): array (nullable = false)
 |    |-- element: integer (containsNull = true)
```

after
```
scala> df.select(array_compact(col("a"))).printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- array_compact(a): array (nullable = false)
 |    |-- element: integer (containsNull = false)
```

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47430 from zhengruifeng/sql_array_compact_data_type.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…mestamp`

### What changes were proposed in this pull request?
Fix type hint for `from_utc_timestamp` and `to_utc_timestamp`

### Why are the changes needed?
the str type input should be treated as literal string, instead of column name

### Does this PR introduce _any_ user-facing change?
doc change

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47429 from zhengruifeng/py_fix_hint_202407.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request?

This RP aims to fix some typos in `LZFBenchmark`.

### Why are the changes needed?

Fix typos and avoid confusion.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47435 from wayneguow/lzf.

Authored-by: Wei Guo <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…k` selectable

### What changes were proposed in this pull request?
The pr aims to make the `input parameters` of `workflows/benchmark` selectable.

### Why are the changes needed?
- Before:
  <img width="311" alt="image" src="https://github.com/user-attachments/assets/da93ea8f-8791-4816-a5d9-f82c018fa819">

- After:
  https://github.com/panbingkun/spark/actions/workflows/benchmark.yml
  <img width="318" alt="image" src="https://github.com/user-attachments/assets/0b9b01a0-96f6-4630-98d9-7d2709aafcd0">

### Does this PR introduce _any_ user-facing change?
Yes, Convenient for developers to run `workflows/benchmark`, transforming input values from only `tex`t to `selectable values`.

### How was this patch tested?
Manually test.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47438 from panbingkun/improve_workflow_dispatch.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…h Dataframe read / write API

### What changes were proposed in this pull request?

PysparkML: Replace RDD read / write API invocation with Dataframe read / write API

### Why are the changes needed?

Follow-up of apache#47341

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up.

Authored-by: Weichen Xu <[email protected]>
Signed-off-by: Weichen Xu <[email protected]>
…baseException` to restore the exception handling

### What changes were proposed in this pull request?
Make `NoSuchNamespaceException` extend `NoSuchNamespaceException`

### Why are the changes needed?
1, apache#47276 made many SQL commands throw `NoSuchNamespaceException` instead of `NoSuchDatabaseException`, it is more then an end-user facing change, it is a breaking change which break the exception handling in 3-rd libraries in the eco-system.

2, `NoSuchNamespaceException` and `NoSuchDatabaseException` actually share the same error class `SCHEMA_NOT_FOUND`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47433 from zhengruifeng/make_nons_nodb.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…utors.py

### What changes were proposed in this pull request?

Support JIRA_ACCESS_TOKEN in translate-contributors.py

### Why are the changes needed?

Remove plaintext password in JIRA_PASSWORD environment variable  to prevent password leakage
### Does this PR introduce _any_ user-facing change?
no, infra only

### How was this patch tested?

Ran translate-contributors.py with 3.5.2 RC

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#47440 from yaooqinn/SPARK-48963.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
The pr aims to upgrade `zstd-jni` from `1.5.6-3` to `1.5.6-4`.

### Why are the changes needed?
1.v1.5.6-3 VS v1.5.6-4
luben/zstd-jni@v1.5.6-3...v1.5.6-4

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47432 from panbingkun/SPARK-48958.

Authored-by: panbingkun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment