Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State source value #4

Closed
wants to merge 169 commits into from
Closed

State source value #4

wants to merge 169 commits into from

Commits on Jul 5, 2024

  1. Configuration menu
    Copy the full SHA
    660448c View commit details
    Browse the repository at this point in the history
  2. Misc updates

    anishshri-db committed Jul 5, 2024
    Configuration menu
    Copy the full SHA
    0fc24fd View commit details
    Browse the repository at this point in the history
  3. Misc updates

    anishshri-db committed Jul 5, 2024
    Configuration menu
    Copy the full SHA
    416ebfa View commit details
    Browse the repository at this point in the history
  4. Fix test

    anishshri-db committed Jul 5, 2024
    Configuration menu
    Copy the full SHA
    a80c9b2 View commit details
    Browse the repository at this point in the history
  5. Misc fix

    anishshri-db committed Jul 5, 2024
    Configuration menu
    Copy the full SHA
    2158e0f View commit details
    Browse the repository at this point in the history

Commits on Jul 22, 2024

  1. [SPARK-48177][BUILD][FOLLOWUP] Update parquet version in `sql-data-so…

    …urces-parquet.md` doc
    
    ### What changes were proposed in this pull request?
    
    This PR aims to update parquet version in `sql-data-sources-parquet.md` doc.
    
    ### Why are the changes needed?
    
    In order to keep consistent with the version of parquet in dependencies.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA and manually confirmed that the new links can be opened.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47242 from wayneguow/SPARK-48177.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: yangjie01 <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9a97ba1 View commit details
    Browse the repository at this point in the history
  2. [SPARK-48720][SQL] Align the command `ALTER TABLE ... UNSET TBLPROPER…

    …TIES ...` in v1 and v2
    
    ### What changes were proposed in this pull request?
    The pr aims to:
    - align the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` in v1 and v2.
    (this means that in the v1, regardless of whether `IF EXISTS` is specified or not, when unset a `non-existent` property, it is `ignored` and no longer `fails`.)
    - update the description of `ALTER TABLE ... UNSET TBLPROPERTIES ...` in the doc `docs/sql-ref-syntax-ddl-alter-table.md`.
    - unify v1 and v2 `ALTER TABLE ... UNSET TBLPROPERTIES ...` tests.
    - Add the following `scenario` for `ALTER TABLE ... SET TBLPROPERTIES ...` testing
    A.`table to alter does not exist`
    B.`alter table set reserved properties`
    
    ### Why are the changes needed?
    - align the command `ALTER TABLE ... UNSET TBLPROPERTIES ...` in v1 and v2, avoid confusing end-users.
    - to improve test coverage.
    - align with other similar tests, eg: `AlterTableSetTblProperties*`
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, in the `v1`, regardless of whether `IF EXISTS` is specified or not, when unset a `non-existent` property, it is `ignored` and no longer `fails`
    
    ### How was this patch tested?
    Update some UT & Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47097 from panbingkun/alter_unset_table.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: yangjie01 <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    bdaecdf View commit details
    Browse the repository at this point in the history
  3. [SPARK-48800][CONNECT][SS] Deflake ClientStreamingQuerySuite

    ### What changes were proposed in this pull request?
    
    The listener test in `ClientStreamingQuerySuite` is flaky.
    
    For client side listeners, the terminated events might take a while before arriving to the client. This test is currently flaky, example: https://github.com/anishshri-db/spark/actions/runs/9785389228/job/27018350836
    
    This PR tries to deflake it by waiting for a longer time.
    
    ### Why are the changes needed?
    
    Deflake test
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Test only change
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47205 from WweiL/deflake-listener-client-scala.
    
    Authored-by: Wei Liu <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    WweiL authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    17a292d View commit details
    Browse the repository at this point in the history
  4. [SPARK-48802][SS][FOLLOWUP] FileStreamSource maxCachedFiles set to 0 …

    …causes batch with no files to be processed
    
    ### What changes were proposed in this pull request?
    
    This is a followup to a bug identified from apache#45362.  When setting `maxCachedFiles` to 0 (to force a full relisting of files for each batch, see https://issues.apache.org/jira/browse/SPARK-44924) subsequent batches of files would be skipped due to a logic error that carried forward an empty array of `unreadFiles` which was only being null checked.  This update includes additional checks to verify that `unreadFiles` is also non-empty as a guard condition to prevent batches executing with no files, as well as checks to ensure that `unreadFiles` is only set if a) there are files remaining in the listing and b) `maxCachedFiles` is greater than 0
    
    ### Why are the changes needed?
    
    Setting the `maxCachedFiles` configuration to 0 would inadvertently cause every other batch to contain 0 files, which is an unexpected behavior for users.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Fixes the case where users may want to always perform a full listing of files each batch by setting `maxCachedFiles` to 0
    
    ### How was this patch tested?
    
    New test added to verify `maxCachedFiles` set to 0 would perform a file listing each batch
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47195 from ragnarok56/filestreamsource-maxcachedfiles-edgecase.
    
    Lead-authored-by: ragnarok56 <[email protected]>
    Co-authored-by: Kevin Nacios <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    ragnarok56 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    ee8bf0c View commit details
    Browse the repository at this point in the history
  5. [SPARK-48810][CONNECT] Session stop() API should be idempotent and no…

    …t fail if the session is already closed by the server
    
    ### What changes were proposed in this pull request?
    
    Improve the error handling of the `stop()` API in the `SparkSesion`
    class to not throw if there is any error related to releasing a session or
    closing the underlying GRPC channel. Both are best effort.
    
    In the case of Pyspark, do not fail if the local Spark Connect service
    cannot be stopped.
    
    ### Why are the changes needed?
    
    In some cases, the Spark Connect Service will terminate the session, usually
    because the underlying cluster or driver has restarted.
    In the cases, calling stop() throws an error which is unactionable. However,
    stop() still needs to be called in order to reset the active session.
    
    Further, the stop() API should be idempotent.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Attached unit tests.
    
    Confirmed that removing the code changes results in the
    tests failing (as expected).
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47215 from nija-at/session-stop.
    
    Authored-by: Niranjan Jayakar <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    nija-at authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    3c1c316 View commit details
    Browse the repository at this point in the history
  6. [SPARK-48818][PYTHON] Simplify percentile functions

    ### What changes were proposed in this pull request?
     Simplify `percentile` functions
    
    ### Why are the changes needed?
    existing implementations are unnecessarily complicated
    
    ### Does this PR introduce _any_ user-facing change?
    No, minor refactoring
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47225 from zhengruifeng/func_refactor_1.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    bec1c16 View commit details
    Browse the repository at this point in the history
  7. [SPARK-48825][DOCS] Unify the 'See Also' section formatting across Py…

    …Spark docstrings
    
    ### What changes were proposed in this pull request?
    
    This PR unifies the 'See Also' section formatting across PySpark docstrings and fixes some invalid references.
    
    ### Why are the changes needed?
    
    To improve PySpark documentation
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    doctest
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47240 from allisonwang-db/spark-48825-also-see-docs.
    
    Authored-by: allisonwang-db <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    allisonwang-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9703899 View commit details
    Browse the repository at this point in the history
  8. [SPARK-48343][SQL] Introduction of SQL Scripting interpreter

    ### What changes were proposed in this pull request?
    Previous [PR](apache#46665) introduced parser changes for SQL Scripting. This PR is a follow-up to introduce the interpreter for SQL Scripting language and proposes the following changes:
    - `SqlScriptingExecutionNode` - introduces execution nodes for SQL scripting, used during interpretation phase:
      - `SingleStatementExec` - executable node for `SingleStatement` logical node; wraps logical plan of the single statement.
      - `CompoundNestedStatementIteratorExec` - implements base recursive iterator logic for all nesting statements.
      - `CompoundBodyExec` - concrete implementation of `CompoundNestedStatementIteratorExec` for `CompoundBody` logical node.
    - `SqlScriptingInterpreter` - introduces the interpreter for SQL scripts. Product of interpretation is the iterator over the statements that should be executed.
    
    Follow-up PRs will introduce further statements, support for exceptions thrown from parser/interpreter, exception handling in SQL, etc.
    More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48343) for this task and its parent (where the design doc is uploaded as well).
    
    ### Why are the changes needed?
    The intent is to add support for SQL scripting (and stored procedures down the line). It gives users the ability to develop complex logic and ETL entirely in SQL.
    
    Until now, users had to write verbose SQL statements or combine SQL + Python to efficiently write the logic. This is an effort to breach that gap and enable complex logic to be written entirely in SQL.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    This PR is second in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
    In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.
    
    ### How was this patch tested?
    There are tests for newly introduced parser changes:
    - `SqlScriptingExecutionNodeSuite` - unit tests for execution nodes.
    - `SqlScriptingInterpreterSuite` - tests for interpreter (with parser integration).
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47026 from davidm-db/sql_scripting_interpreter.
    
    Authored-by: David Milicevic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    davidm-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    357b483 View commit details
    Browse the repository at this point in the history
  9. [SPARK-48776] Fix timestamp formatting for json, xml and csv

    ### What changes were proposed in this pull request?
    
    In this pull request i propose to change default ISO pattern we use for formatting timestamps when we are writing to json,xml and/or csv as well as when to_(xml|json|csv) is used.
    
    Older timestamps sometimes have offsets that contain seconds part as well. Current default formatting used is omitting seconds hence providing wrong results.
    
    e.g.
    ```
    sql("SET spark.sql.session.timeZone=America/Los_Angeles")
    sql("SELECT to_json(struct(CAST('1800-01-01T00:00:00+00:00' AS TIMESTAMP) AS ts))").show(false)
    {"ts":"1799-12-31T16:07:02.000-07:52"}
    ```
    
    ### Why are the changes needed?
    
    This is correctness issue.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, users will now see different results for older timestamps (correct ones).
    
    ### How was this patch tested?
    
    Tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47177 from milastdbx/dev/milast/fixJsonTimestampHandling.
    
    Authored-by: milastdbx <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    milastdbx authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    c733573 View commit details
    Browse the repository at this point in the history
  10. [SPARK-48798][PYTHON] Introduce spark.profile.render for SparkSessi…

    …on-based profiling
    
    ### What changes were proposed in this pull request?
    
    Introduces `spark.profile.render` for SparkSession-based profiling.
    
    It uses [`flameprof`](https://github.com/baverman/flameprof/) for the default renderer.
    
    ```
    $ pip install flameprof
    ```
    
    run `pyspark` on Jupyter notebook:
    
    ```py
    from pyspark.sql.functions import pandas_udf
    
    spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
    
    df = spark.range(10)
    
    pandas_udf("long")
    def add1(x):
        return x + 1
    
    added = df.select(add1("id"))
    added.show()
    
    spark.profile.render(id=2)
    ```
    
    <img width="1103" alt="pyspark-udf-profile" src="https://github.com/apache/spark/assets/506656/795972e8-f7eb-4b89-89fc-3d8d18b86541">
    
    On CLI, it will return `svg` source string.
    
    ```py
    '<?xml version="1.0" standalone="no"?>\n<!DOCTYPE svg  ...
    ```
    
    Currently only `renderer="flameprof"` for `type="perf"` is supported as a builtin renderer.
    
    You can also pass an arbitrary renderer.
    
    ```py
    def render_perf(stats):
        ...
    spark.profile.render(id=2, type="perf", renderer=render_perf)
    
    def render_memory(codemap):
        ...
    spark.profile.render(id=2, type="memory", renderer=render_memory)
    ```
    
    ### Why are the changes needed?
    
    Better debuggability.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, `spark.profile.render` will be available.
    
    ### How was this patch tested?
    
    Added/updated the related tests, and manually.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47202 from ueshin/issues/SPARK-48798/render.
    
    Authored-by: Takuya Ueshin <[email protected]>
    Signed-off-by: Takuya Ueshin <[email protected]>
    ueshin authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    632825d View commit details
    Browse the repository at this point in the history
  11. [MINOR][PYTHON] Eliminating warnings for panda

    ### What changes were proposed in this pull request?
    The pr aims to eliminating warnings for panda: `<string>:5: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.`
    
    ### Why are the changes needed?
    Only eliminating warnings for panda
    https://github.com/panbingkun/spark/actions/runs/9795675050/job/27048513673
    <img width="856" alt="image" src="https://github.com/apache/spark/assets/15246973/ea70e922-897e-450f-b150-3d38d7f20930">
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    - Pass GA.
    - Manually test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47222 from panbingkun/remove_pandas_warning.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    561bfc2 View commit details
    Browse the repository at this point in the history
  12. [SPARK-48307][SQL][FOLLOWUP] Eliminate the use of mutable.ArrayBuffer

    ### What changes were proposed in this pull request?
    
    We can eliminate the use of mutable.ArrayBuffer by using `flatmap`.
    
    ### Why are the changes needed?
    
    Code simplification and optimization.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing UT
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47185 from amaliujia/followup_cte.
    
    Lead-authored-by: Rui Wang <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    86bff6d View commit details
    Browse the repository at this point in the history
  13. [SPARK-48742][SS] Virtual Column Family for RocksDB

    ### What changes were proposed in this pull request?
    
    Introducing virtual column family to RocksDB. We attach an 2-byte-Id prefix as column family identifier for each of the key row that is put into RocksDB. The encoding and decoding of the virtual column family prefix happens at the `RocksDBKeyEncoder` layer as we can pre-allocate extra 2 bytes and avoid additional memcpy.
    
    - Remove Physical Column Family related codes as this becomes potentially dead code till some caller starts using this.
    - Remove `useColumnFamilies` from `StateStoreChangelogV2` API.
    
    ### Why are the changes needed?
    
    Currently within  the scope of the arbitrary stateful API v2 (transformWithState)  project, each state variable is stored inside one [physical column family](https://github.com/facebook/rocksdb/wiki/Column-Families) within the RocksDB state store instance. Column families are also used to implement secondary indexes for various features. Each physical column family has its own memtables, creates its own SST files, and handles  compaction independently on those independent SST files.
    
    When the number of operations to RocksDB is relatively small and the number of column families is relatively large, the overhead of handling small SST files becomes high, especially since all of these have to be uploaded in the snapshot dir and referenced in the metadata file for the uploaded RocksDB snapshot. Using prefix to manage different key spaces / virtual column family could reduce such overheads.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. If `useColumnFamilies` are set to true in the `StateStore.init()`, virtual column family will be used.
    
    ### How was this patch tested?
    
    Unit tests in `RocksDBStateStoreSuite`, and integration tests in `TransformWithStateSuite`.
    Moved test suites in `RocksDBSuite` into `RocksDBStateStoreSuite` because some previous verification functions are now moved into `RocksDBStateProvider`
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47107 from jingz-db/virtual-col-family.
    
    Lead-authored-by: jingz-db <[email protected]>
    Co-authored-by: Jing Zhan <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    jingz-db and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9ed19c0 View commit details
    Browse the repository at this point in the history
  14. [SPARK-48801][BUILD][K8S] Upgrade kubernetes-client to 6.13.1

    ### What changes were proposed in this pull request?
    The pr aims to upgrade `kubernetes-client` from `6.13.0` to `6.13.1`.
    
    ### Why are the changes needed?
    - The full release notes: https://github.com/fabric8io/kubernetes-client/releases/tag/v6.13.1
    - The newest version fixed some bug, eg:
      Fix fabric8io/kubernetes-client#6059: Swallow rejected execution from internal usage of the informer executor
      Fix fabric8io/kubernetes-client#6068: KubernetesMockServer provides incomplete Configuration while creating test Config for KubernetesClient
      Fix fabric8io/kubernetes-client#6085: model getters have same annotations as fields (breaks native)
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47206 from panbingkun/SPARK-48801.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    ec6fadb View commit details
    Browse the repository at this point in the history
  15. [SPARK-48837][ML] In CountVectorizer, only read binary parameter once…

    … per transform, not once per row
    
    ### What changes were proposed in this pull request?
    
    apache#11536 added a `binary` toggle parameter to `CountVectorizer`, but the parameter evaluation occurs inside of the vectorizer UDF itself: this causes expensive parameter reading to occur once-per-row instead of once-per-transform.
    
    This PR addresses this issue by updating the code to only read the parameter once, similar to what was already being done for the `minTf` parameter.
    
    ### Why are the changes needed?
    
    Address a performance issue.
    
    I spotted this issue when I saw the stack
    
    ```scala
    [...]
    at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:204)
    at scala.collection.IndexedSeqOptimized.exists(IndexedSeqOptimized.scala:49)
    at scala.collection.IndexedSeqOptimized.exists$(IndexedSeqOptimized.scala:49)
    at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:198)
    at org.apache.spark.ml.param.Params.hasParam(params.scala:701)
    at org.apache.spark.ml.param.Params.hasParam$(params.scala:700)
    at org.apache.spark.ml.PipelineStage.hasParam(Pipeline.scala:42)
    at org.apache.spark.ml.param.Params.shouldOwn(params.scala:856)
    at org.apache.spark.ml.param.Params.get(params.scala:739)
    at org.apache.spark.ml.param.Params.get$(params.scala:738)
    at org.apache.spark.ml.PipelineStage.get(Pipeline.scala:42)
    at org.apache.spark.ml.param.Params.getOrDefault(params.scala:759)
    at org.apache.spark.ml.param.Params.getOrDefault$(params.scala:757)
    at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
    at org.apache.spark.ml.param.Params.$(params.scala:766)
    at org.apache.spark.ml.param.Params.$$(params.scala:766)
    at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
    at org.apache.spark.ml.feature.CountVectorizerModel.$anonfun$transform$1(CountVectorizer.scala:326)
    at org.apache.spark.ml.feature.CountVectorizerModel$$Lambda$12153/1200761496.apply(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source)
    [...]
    ```
    
    while investigating an unrelated issue.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47258 from JoshRosen/CountVectorizer-conf.
    
    Authored-by: Josh Rosen <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    JoshRosen authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    2f586c3 View commit details
    Browse the repository at this point in the history
  16. [SPARK-48803][SQL] Throw internal error in Orc(De)serializer to align…

    … with ParquetWriteSupport
    
    ### What changes were proposed in this pull request?
    
    Kind of follow-up of apache#44275, this PR aligned 2 similar code paths with different error messages into one.
    
    ```java
    24/07/03 16:29:01 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
    org.apache.spark.SparkException: [INTERNAL_ERROR] Unsupported data type VarcharType(64). SQLSTATE: XX000
    	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
    	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
    ```
    
    ```java
    org.apache.spark.SparkUnsupportedOperationException: VarcharType(64) is not supported yet.
    	at org.apache.spark.sql.errors.QueryExecutionErrors$.dataTypeUnsupportedYetError(QueryExecutionErrors.scala:993)
    	at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.newConverter(OrcSerializer.scala:209)
    	at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.$anonfun$converters$2(OrcSerializer.scala:35)
    	at scala.collection.immutable.List.map(List.scala:247)
    ```
    
    ### Why are the changes needed?
    
    improvement
    
    ### Does this PR introduce _any_ user-facing change?
    
    No, users shouldn't face such errors in regular cases.
    ### How was this patch tested?
    
    passing existing tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes apache#47208 from yaooqinn/SPARK-48803.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    07329e5 View commit details
    Browse the repository at this point in the history
  17. [SPARK-48623][CORE] Structured logging migrations [Part 2]

    ### What changes were proposed in this pull request?
    This PR makes additional Scala logging migrations to comply with the scala style changes in apache#46947
    
    ### Why are the changes needed?
    This makes development and PR review of the structured logging migration easier.
    
    ### Does this PR introduce any user-facing change?
    No
    
    ### How was this patch tested?
    Tested by ensuring dev/scalastyle checks pass
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47256 from asl3/morestructuredloggingmigrations.
    
    Authored-by: Amanda Liu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    asl3 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    7035755 View commit details
    Browse the repository at this point in the history
  18. [SPARK-48804][SQL] Add classIsLoadable & OutputCommitter.isAssignable…

    …From check for output committer class configrations
    
    ### What changes were proposed in this pull request?
    
    This pull request proposed adding a checker for class values provided by users in `spark.sql.sources.outputCommitterClass` and `spark.sql.parquet.output.committer.class` to make sure the given class is visible from the classpath and a subclass of `org.apache.hadoop.mapreduce.OutputCommitter`
    
    ### Why are the changes needed?
    
    Ensure that an invalid configuration results in immediate application or query failure rather than failing late during setupJob.
    ### Does this PR introduce _any_ user-facing change?
    
    no
    ### How was this patch tested?
    
    new tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47209 from yaooqinn/SPARK-48804.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    dc237b2 View commit details
    Browse the repository at this point in the history
  19. [SPARK-48771][SQL] Speed up `LogicalPlanIntegrity.validateExprIdUniqu…

    …eness` for large query plans
    
    ### What changes were proposed in this pull request?
    
    This PR rewrites `LogicalPlanIntegrity.hasUniqueExprIdsForOutput` to only traverse the query plan once and avoids expensive Scala collections operations like `.flatten`, `.groupBy`, and `.distinct`.
    
    ### Why are the changes needed?
    
    Speeds up query compilation when plan validation is enabled.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Made sure existing UTs pass.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47170 from kelvinjian-db/SPARK-48771-speed-up.
    
    Authored-by: Kelvin Jiang <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    kelvinjian-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    154e350 View commit details
    Browse the repository at this point in the history
  20. [SPARK-48816][SQL] Shorthand for interval converters in UnivocityParser

    ### What changes were proposed in this pull request?
    
    Directly call `IntervalUtils.castStringToDTInterval/castStringToYMInterval` instead of creating Cast expressions to evaluate.
    
    - Benchmarks indicated a 10% time-saving.
    - Bad record recording might not work if the cast handles the exceptions early
    
    ### Why are the changes needed?
    
    - pref improvement
    - Bugfix for bad record recording
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    passing existing tests and benchmark tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47227 from yaooqinn/SPARK-48816.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f287c78 View commit details
    Browse the repository at this point in the history
  21. [SPARK-48826][BUILD] Upgrade fasterxml.jackson to 2.17.2

    ### What changes were proposed in this pull request?
    
    This PR amis to upgrade `fasterxml.jackson` from 2.17.1 to 2.17.2.
    
    ### Why are the changes needed?
    
    There are some bug fixes about [Databind](https://github.com/FasterXML/jackson-databind):
    [apache#4561](FasterXML/jackson-databind#4561): Issues using jackson-databind 2.17.1 with Reactor (wrt DeserializerCache and ReentrantLock)
    [apache#4575](FasterXML/jackson-databind#4575): StdDelegatingSerializer does not consider a Converter that may return null for a non-null input
    [apache#4577](FasterXML/jackson-databind#4577): Cannot deserialize value of type java.math.BigDecimal from String "3." (not a valid representation)
    [apache#4595](FasterXML/jackson-databind#4595): No way to explicitly disable wrapping in custom annotation processor
    [apache#4607](FasterXML/jackson-databind#4607): MismatchedInput: No Object Id found for an instance of X to assign to property 'id'
    [apache#4610](FasterXML/jackson-databind#4610): DeserializationFeature.FAIL_ON_UNRESOLVED_OBJECT_IDS does not work when used with Polymorphic type handling
    
    The full release note of 2.17.2:
    https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.17.2
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47241 from wayneguow/upgrade_jackson.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: yangjie01 <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    ca24fc0 View commit details
    Browse the repository at this point in the history
  22. [SPARK-48840][INFRA] Remove unnecessary existence check for `./dev/fr…

    …ee_disk_space_container`
    
    ### What changes were proposed in this pull request?
    This PR removed the check for the existence of `./dev/free_disk_space_container` before execution,  because `./dev/free_disk_space_container` has already been backported to branch-3.4 and branch-3.5 through apache#45624 and apache#43381,  so there is no need to check its existence before execution.
    
    ### Why are the changes needed?
    Remove unnecessary existence check for `./dev/free_disk_space_container`.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass GitHub Actions
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47263 from LuciferYang/SPARK-48840.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    LuciferYang authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    925673e View commit details
    Browse the repository at this point in the history
  23. [SPARK-48743][SQL][SS] MergingSessionIterator should better handle wh…

    …en getStruct returns null
    
    ### What changes were proposed in this pull request?
    
    The getStruct() method used in `MergingSessionIterator.initialize` could return a null value. When that happens, the copy() called upon it throws a NullPointerException.
    
    We see an exception thrown there:
    ```
    ava.lang.NullPointerException: <Redacted Exception Message>
    	at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.initialize(MergingSessionsIterator.scala:121)
    	at org.apache.spark.sql.execution.aggregate.MergingSessionsIterator.<init>(MergingSessionsIterator.scala:130)
    	at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1(MergingSessionsExec.scala:93)
    	at org.apache.spark.sql.execution.aggregate.MergingSessionsExec.$anonfun$doExecute$1$adapted(MergingSessionsExec.scala:72)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2(RDD.scala:920)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndexInternal$2$adapted(RDD.scala:920)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
    	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:409)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:406)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:373)
    	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
    	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:201)
    	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:189)
    	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:154)
    	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:129)
    	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:148)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.scheduler.Task.run(Task.scala:101)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:984)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:105)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:987)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:879)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:750)
    
    ```
    
    It is still not clear why that field could be null, but in general Spark should not throw NPEs. So this PR purposes to wrap it with SparkException.internalError with more details.
    
    ### Why are the changes needed?
    
    Improvemtns
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    This is a hard-to repro issue. The change should not cause any harm.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47134 from WweiL/SPARK-48743-mergingSessionIterator-null-init.
    
    Authored-by: Wei Liu <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    WweiL authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    dfdda1c View commit details
    Browse the repository at this point in the history
  24. [MINOR][DOCS] Fix some typos in docs

    ### What changes were proposed in this pull request?
    The pr aims to fix some typos in some docs, includes: `docs/sql-ref-syntax-qry-star.md`, `docs/running-on-kubernetes.md` and `connector/profiler/README.md`.
    
    ### Why are the changes needed?
    https://spark.apache.org/docs/4.0.0-preview1/sql-ref-syntax-qry-star.html
    In some `sql examples` in the doc `docs/sql-ref-syntax-qry-star.md`, `Unicode Character 'SINGLE QUOTATION MARK'` was used, which resulted in the end-user being unable to execute successfully after `copy-paste`, eg:
    <img width="660" alt="image" src="https://github.com/apache/spark/assets/15246973/055aa0a8-602e-4ea7-a065-c8e0353c6fb3">
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, the end-users will face more user-friendly docs.
    
    ### How was this patch tested?
    Manually test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47261 from panbingkun/fix_typo_docs.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    28e4971 View commit details
    Browse the repository at this point in the history
  25. [SPARK-48831][CONNECT] Make default column name of cast compatible …

    …with Spark Classic
    
    ### What changes were proposed in this pull request?
    
    I think there are two issues regarding the default column name of `cast`:
    1, It seems unclear that when the name is the input column or `CAST(...)`, e.g. in Spark Classic,
    ```
    scala> spark.range(1).select(col("id").cast("string"), lit(1).cast("string"), col("id").cast("long"), lit(1).cast("long")).printSchema
    warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
    root
     |-- id: string (nullable = false)
     |-- CAST(1 AS STRING): string (nullable = false)
     |-- id: long (nullable = false)
     |-- CAST(1 AS BIGINT): long (nullable = false)
    ```
    
    2, the column name is not consistent between Spark Connect and Spark Classic.
    
    This PR aims to resolve the second issue, that is, making default column name of `cast` compatible with Spark Classic, by comparing with classic implementation
    https://github.com/apache/spark/blob/9cf6dc873ff34412df6256cdc7613eed40716570/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1208-L1212
    
    ### Why are the changes needed?
    the default column name is not consistent with the spark classic
    
    ### Does this PR introduce _any_ user-facing change?
    yes,
    
    spark classic:
    ```
    In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
    +-------------------------+-------------------+-------------------+-------------------+
    |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
    +-------------------------+-------------------+-------------------+-------------------+
    |                      123|                123|                123|              123.0|
    +-------------------------+-------------------+-------------------+-------------------+
    ```
    
    spark connect (before):
    ```
    In [3]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
    +---------+---+---+-----+
    |X'313233'|123|123|  123|
    +---------+---+---+-----+
    |      123|123|123|123.0|
    +---------+---+---+-----+
    ```
    
    spark connect (after):
    ```
    In [2]: spark.range(1).select(sf.lit(b'123').cast("STRING"), sf.lit(123).cast("STRING"), sf.lit(123).cast("LONG"), sf.lit(123).cast("DOUBLE")).show()
    +-------------------------+-------------------+-------------------+-------------------+
    |CAST(X'313233' AS STRING)|CAST(123 AS STRING)|CAST(123 AS BIGINT)|CAST(123 AS DOUBLE)|
    +-------------------------+-------------------+-------------------+-------------------+
    |                      123|                123|                123|              123.0|
    +-------------------------+-------------------+-------------------+-------------------+
    ```
    
    ### How was this patch tested?
    added test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47249 from zhengruifeng/py_fix_cast.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5f51081 View commit details
    Browse the repository at this point in the history
  26. [SPARK-48760][SQL][DOCS][FOLLOWUP] Add CLUSTER BY to doc `sql-ref-s…

    …yntax-ddl-alter-table.md`
    
    ### What changes were proposed in this pull request?
    The pr is  following up apache#47156, aims to
    - add `CLUSTER BY` to doc `sql-ref-syntax-ddl-alter-table.md`
    - move parser tests from `o.a.s.s.c.p.DDLParserSuite` to `AlterTableClusterByParserSuite`
    - use `checkError` to check exception in `o.a.s.s.e.c.AlterTableClusterBySuiteBase`
    
    ### Why are the changes needed?
    - Enable the doc `sql-ref-syntax-ddl-alter-table.md` to cover new syntax `ALTER TABLE ... CLUSTER BY ...`.
    - Align with other similar tests, eg: AlterTableRename*
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, Make end-users can query the explanation of `CLUSTER BY` through the doc `sql-ref-syntax-ddl-alter-table.md`.
    
    ### How was this patch tested?
    Updated UT.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47254 from panbingkun/SPARK-48760_FOLLOWUP.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    2195af7 View commit details
    Browse the repository at this point in the history
  27. [SPARK-46625] CTE with Identifier clause as reference

    ### What changes were proposed in this pull request?
    DECLARE agg = 'max';
    DECLARE col = 'c1';
    DECLARE tab = 'T';
    
    WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
          T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
    SELECT IDENTIFIER(agg)(IDENTIFIER(col)) FROM IDENTIFIER(tab);
    
    -- OR
    
    WITH S(c1, c2) AS (VALUES(1, 2), (2, 3)),
          T(c1, c2) AS (VALUES ('a', 'b'), ('c', 'd'))
    SELECT IDENTIFIER('max')(IDENTIFIER('c1')) FROM IDENTIFIER('T');
    
    Currently we don't support Identifier clause as part of CTE reference.
    
    ### Why are the changes needed?
    Adding support for Identifier clause as part of CTE reference for both constant string expressions and session variables.
    
    ### Does this PR introduce _any_ user-facing change?
    It contains user facing changes in sense that identifier clause as cte reference will now be supported.
    
    ### How was this patch tested?
    Added tests as part of this PR.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47180 from nebojsa-db/SPARK-46625.
    
    Authored-by: Nebojsa Savic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    nebojsa-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9bcff35 View commit details
    Browse the repository at this point in the history
  28. [SPARK-48716] Add jobGroupId to SparkListenerSQLExecutionStart

    ### What changes were proposed in this pull request?
    Add jobGroupId to SparkListenerSQLExecutionStart
    
    ### Why are the changes needed?
    JobGroupId can be used to combine jobs within the same group. This is going to be useful in the listener so it makes the job grouping easy to do
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit Test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47092 from gjxdxh/gjxdxh/SPARK-48716.
    
    Authored-by: Lingkai Kong <[email protected]>
    Signed-off-by: Josh Rosen <[email protected]>
    gjxdxh authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    1cb76c3 View commit details
    Browse the repository at this point in the history
  29. [SPARK-44728][PYTHON][DOCS] Fix the incorrect naming and missing para…

    …ms in func docs in `builtin.py`
    
    ### What changes were proposed in this pull request?
    
    Fix the incorrect naming and missing params in func docs in `builtin.py`.
    
    ### Why are the changes needed?
    
    Some params' name in `pySpark` docs are wrong, for example:
    ![image](https://github.com/apache/spark/assets/16032294/af0ca3c9-b085-4364-8cfc-814371f21b4b)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Passed GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47269 from wayneguow/py_docs.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    520e519 View commit details
    Browse the repository at this point in the history
  30. [SPARK-48817][SQL] Eagerly execute union multi commands together

    ### What changes were proposed in this pull request?
    
    Eagerly execute union multi commands together.
    
    ### Why are the changes needed?
    MultiInsert is split to multiple sql executions, resulting in no exchange reuse.
    
    Reproduce sql:
    
    ```
    create table wangzhen_t1(c1 int);
    create table wangzhen_t2(c1 int);
    create table wangzhen_t3(c1 int);
    insert into wangzhen_t1 values (1), (2), (3);
    
    from (select /*+ REPARTITION(3) */ c1 from wangzhen_t1)
    insert overwrite table wangzhen_t2 select c1
    insert overwrite table wangzhen_t3 select c1;
    ```
    
    In Spark 3.1, there is only one SQL execution and there is a reuse exchange.
    
    ![image](https://github.com/apache/spark/assets/17894939/5ff68392-aaa8-4e6b-8cac-1687880796b9)
    
    However, in Spark 3.5, it was split to multiple executions and there was no ReuseExchange.
    
    ![image](https://github.com/apache/spark/assets/17894939/afdb14b6-5007-4923-802d-535149974ecf)
    ![image](https://github.com/apache/spark/assets/17894939/0d60e8db-9da7-4906-8d07-2b622b55e6ab)
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes,  multi  inserts will executed in one execution.
    
    ### How was this patch tested?
    
    added unit test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47224 from wForget/SPARK-48817.
    
    Authored-by: wforget <[email protected]>
    Signed-off-by: youxiduo <[email protected]>
    wForget authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    bb75014 View commit details
    Browse the repository at this point in the history
  31. [SPARK-48822][DOCS] Add examples section header to format_number do…

    …cstring
    
    ### What changes were proposed in this pull request?
    This PR adds and "Examples" section header to `format_number` docstring.
    
    ### Why are the changes needed?
    To improve the documentation.
    
    ### Does this PR introduce any user-facing change?
    No changes in behavior are introduced.
    
    ### How was this patch tested?
    Existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47237 from thomhart31/docs-format_number.
    
    Lead-authored-by: thomas.hart <[email protected]>
    Co-authored-by: Thomas Hart <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    d458c46 View commit details
    Browse the repository at this point in the history
  32. [SPARK-36680][SQL] Supports Dynamic Table Options for Spark SQL

    ### What changes were proposed in this pull request?
    in Spark SQL, add 'WITH OPTIONS' syntax to support dynamic relation options.
    
    This is a continuation of apache#41683 based on cloud-fan's nice suggestion.
    That was itself a continuation of apache#34072.
    
    ### Why are the changes needed?
    
    This will allow Spark SQL to have equivalence to DataFrameReader API.  For example, it is possible to specify options today to DataSources as follows via the API:
    
    ```
     spark.read.format("jdbc").option("fetchSize", 0).load()
    ```
    
    This pr allows an equivalent Spark SQL syntax to specify options:
    ```
    SELECT * FROM jdbcTable WITH OPTIONS(`fetchSize` = 0)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Unit test in DataSourceV2SQLSuite
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#46707 from szehon-ho/spark-36680.
    
    Authored-by: Szehon Ho <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    szehon-ho authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5330b88 View commit details
    Browse the repository at this point in the history
  33. [SPARK-48772][SS][SQL] State Data Source Change Feed Reader Mode

    This PR adds ability of showing the evolution of state as Change Data Capture (CDC) format to state data source.
    
    An example usage:
    ```
    .format("statestore")
    .option("readChangeFeed", true)
    .option("changeStartBatchId", 5) #required
    .option("changeEndBatchId", 10)  #not required, default: latest batch Id available
    ```
    _Note that this mode does not support the option "joinSide"._
    
    The current state reader can only return the entire state at a specific version. If an error occurs related to state, knowing the change of state across versions to find out at which version state starts to go wrong is important for debugging purposes.
    
    No.
    
    Adds a new test suite `StateDataSourceChangeDataReadSuite` that includes 1) testing input error 2) testing new API added 3) integration test.
    
    No.
    
    Closes apache#47188 from eason-yuchen-liu/readStateChange.
    
    Lead-authored-by: Yuchen Liu <[email protected]>
    Co-authored-by: Yuchen Liu <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    00a5972 View commit details
    Browse the repository at this point in the history
  34. [SPARK-48807][SQL] Binary Support for CSV datasource

    ### What changes were proposed in this pull request?
    
    SPARK-42237 disabled binary output for CSV because the binary values use `java.lang.Object.toString` for outputting. Now we have meaningful binary string representations support in UnivocityGenerator, we can support it now.
    
    ### Why are the changes needed?
    
    improve csv with spark sql types
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, but it's from failures to success with binary csv tables
    
    ### How was this patch tested?
    
    new tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47212 from yaooqinn/SPARK-48807.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    091b99d View commit details
    Browse the repository at this point in the history
  35. [SPARK-48848][PYTHON][DOCS] Set the upper bound version of `sphinxcon…

    …trib-*` in `dev/requirements.txt` with `sphinx==4.5.0`
    
    ### What changes were proposed in this pull request?
    
    This PR aims to set the upper bound version of 'sphinxcontrib-*' in `dev/requirements.txt` with `sphinx==4.5.0`.
    
    ### Why are the changes needed?
    
    Currently, if Spark developers use the command `pip install --upgrade -r dev/requirements.txt` directly to install python-related dependencies, the automatically installed `sphinxcontrib-*` versions don't match `sphinx==4.5.0`. Refered to the issue: sphinx-doc/sphinx#11890.
    
    Then they execute the `make html` command for building pySpark docs and the following error will appear:
    <img width="1211" alt="image" src="https://github.com/apache/spark/assets/16032294/719c4b1d-9b7d-4ba9-89c5-ec3c0dc4572f">
    
    This problem has been avoided through pinning `sphinxcontrib-*` in workflows of Spark GA:
    ![image](https://github.com/apache/spark/assets/16032294/bf4906f1-a76d-47bd-af42-f263537f371c)
    
    So we can do the similar way by setting the upper bound version of in `requirements.txt` and it will be helpful for Spark developers when making pySpark docs.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47270 from wayneguow/py_require.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    ceb9dc5 View commit details
    Browse the repository at this point in the history
  36. [SPARK-48854][DOCS] Add missing options in CSV documentation

    ### What changes were proposed in this pull request?
    
    This PR added documents for missing CSV options, including `delimiter` as an alternative to `sep`, `charset` as an alternative to `encoding`, `codec` as an alternative to `compression`, and `timeZone`, excluding `columnPruning` which falls back to an internal SQL config.
    
    ### Why are the changes needed?
    
    improvement for user guide
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    doc build
    
    ![image](https://github.com/apache/spark/assets/8326978/d8ff888b-cafa-44e6-ab74-7bf69702a267)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes apache#47278 from yaooqinn/SPARK-48854.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f3804c0 View commit details
    Browse the repository at this point in the history
  37. [SPARK-48843] Prevent infinite loop with BindParameters

    ### What changes were proposed in this pull request?
    
    In order to resolve the named parameters on the subtree, BindParameters recurses into the subtrees and tries to match the pattern with the named parameters. If there's no named parameter in the current level, the rule tries to return the unchanged plan. However, instead of returning the current plan object, the rule always returns the captured root plan node, leading into the infinite recursion.
    
    ### Why are the changes needed?
    
    Infinite recursion with the named parameters and the global limit.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Added unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47271 from nemanja-boric-databricks/fix-bind.
    
    Lead-authored-by: Nemanja Boric <[email protected]>
    Co-authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    06be01f View commit details
    Browse the repository at this point in the history
  38. [SPARK-48791][CORE] Fix perf regression caused by the accumulators re…

    …gistration overhead using CopyOnWriteArrayList
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to use the `ArrayBuffer` together with the read/write lock rather than `CopyOnWriteArrayList` for `TaskMetrics._externalAccums`.
    
    ### Why are the changes needed?
    
    Fix the perf regression that caused by the accumulators registration overhead using `CopyOnWriteArrayList`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Manually tested.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47197 from Ngone51/SPARK-48791.
    
    Authored-by: Yi Wu <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    Ngone51 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    1691bbd View commit details
    Browse the repository at this point in the history
  39. [SPARK-48857][SQL] Restrict charsets in CSVOptions

    ### What changes were proposed in this pull request?
    
    SPARK-46115 SPARK-46220 started the work for building a consistent charset list for spark, the PR brings it to CSV options
    
    ### Why are the changes needed?
    
    To make the charset list consistent across different platforms/JDKs
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, legacyCharsets is provided
    
    ### How was this patch tested?
    
    new tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes apache#47280 from yaooqinn/SPARK-48857.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    7092147 View commit details
    Browse the repository at this point in the history
  40. [SPARK-48855][K8S][TESTS] Make ExecutorPodsAllocatorSuite independe…

    …nt from default allocation batch size
    
    ### What changes were proposed in this pull request?
    
    This PR aims to make `ExecutorPodsAllocatorSuite` independent from default allocation batch size.
    
    ### Why are the changes needed?
    
    To make test assumption explicitly.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47279 from dongjoon-hyun/SPARK-48855.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    dongjoon-hyun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    11fee3b View commit details
    Browse the repository at this point in the history
  41. [MINOR][DOCS] Add example to countDistinct

    ### What changes were proposed in this pull request?
    This PR adds an example to `countDistinct` docstring demonstrate `count_distinct` and `countDistinct` provide same functionality.
    
    ### Why are the changes needed?
    To improve the documentation.
    
    ### Does this PR introduce any user-facing change?
    No changes in behavior are introduced.
    
    ### How was this patch tested?
    Existing tests.
    
    Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47235 from thomhart31/docs-countDistinct.
    
    Lead-authored-by: thomas.hart <[email protected]>
    Co-authored-by: Thomas Hart <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5b84f9a View commit details
    Browse the repository at this point in the history
  42. [SPARK-48823][DOCS] Improve clarity in lag docstring

    ### What changes were proposed in this pull request?
    This PR edits grammar in `pyspark.sql.functions.lag` docstring.
    
    ### Why are the changes needed?
    To improve the documentation.
    
    ### Does this PR introduce any user-facing change?
    No changes in behavior are introduced.
    
    ### How was this patch tested?
    Existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47236 from thomhart31/docs-lag.
    
    Authored-by: thomas.hart <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    thomash-dbx authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    ecff815 View commit details
    Browse the repository at this point in the history
  43. [SPARK-48844][SQL] USE INVALID_EMPTY_LOCATION instead of UNSUPPORTED_…

    …DATASOURCE_FOR_DIRECT_QUERY when path is empty
    
    ### What changes were proposed in this pull request?
    
    When running sql on valid datasource files directly, if the given path is an empty string, we currently report UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY, which claims the datasource is invalid. The reason is that the `hadoop.Path` class can not be constructed with empty strings and we wrap `IAE` with UNSUPPORTED_DATASOURCE_FOR_DIRECT_QUERY.
    
    In this PR, we check the path ahead to avoid this ambiguous error message
    
    ### Why are the changes needed?
    
    trivial bugfix, although this error rarely occurs in REPL environments but might still get a chance to happen when using the query with string interpolation.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, different error class
    
    ### How was this patch tested?
    
    new tests
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47267 from yaooqinn/SPARK-48844.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    4d9e016 View commit details
    Browse the repository at this point in the history
  44. Revert "[SPARK-48823][DOCS] Improve clarity in lag docstring"

    This reverts commit 8ca1822.
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    dca78f1 View commit details
    Browse the repository at this point in the history
  45. [SPARK-48763][CONNECT][BUILD] Move connect server and common to built…

    …in module
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to move the connect server to builtin module.
    
    From:
    
    ```
    connector/connect/server
    connector/connect/common
    ```
    
    To:
    
    ```
    connect/server
    connect/common
    ```
    
    ### Why are the changes needed?
    
    So the end users do not have to specify `--packages` when they start the Spark Connect server. Spark Connect client remains as a separate module. This was also pointed out in apache#39928 (comment).
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, users don't have to specify `--packages` anymore.
    
    ### How was this patch tested?
    
    CI in this PR should verify them.
    Also manually tested several basic commands such as:
    
    - Maven build
    - SBT build
    - Running basic Scala client commands
       ```bash
       cd connector/connect
       bin/spark-connect
       bin/spark-connect-scala-client
       ```
    - Running basic PySpark client commands
    
       ```bash
       bin/pyspark --remote local
       ```
    - Connecting to the server launched by `./sbin/start-connect-server.sh`
    
       ```bash
       ./sbin/start-connect-server.sh
        bin/pyspark --remote "sc://localhost"
       ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47157 from HyukjinKwon/move-connect-server-builtin.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    HyukjinKwon authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    7ac02e3 View commit details
    Browse the repository at this point in the history
  46. [SPARK-48860][TESTS] Update ui-test to use ws 8.18.0

    ### What changes were proposed in this pull request?
    
    This is a test dependency update to use `ws` 8.18.0.
    
    ### Why are the changes needed?
    
    Although Apache Spark binary is not affected by this, this PR aims to resolve this alert which recommends `ws` versions 8.17.1+.
    
    - https://github.com/apache/spark/security/dependabot/95
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs with the new dependency.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47287 from dongjoon-hyun/SPARK-48860.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    dongjoon-hyun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    6c16052 View commit details
    Browse the repository at this point in the history
  47. [SPARK-48862][PYTHON][CONNECT] Avoid calling _proto_to_string when …

    …INFO level is not enabled
    
    ### What changes were proposed in this pull request?
    
    Avoid calling `_proto_to_string` when INFO level is not enabled.
    
    ### Why are the changes needed?
    
    We should avoid `_proto_to_string` as it takes long time, although the result is not used if INFO level is not enabled.
    
    E.g.,
    
    ```py
    from functools import reduce
    
    df = createDataFrame()
    def project_schema(n=100):
        return reduce(lambda df, _: df.select(F.col("a"), F.col("b"), F.col("c"), F.col("d")), range(n), df).schema
    
    profile(project_schema)
    ```
    
    <img width="1104" alt="Screenshot 2024-07-10 at 17 24 18" src="https://github.com/apache/spark/assets/506656/66c2f50e-13b8-43f0-a46c-dcad4e7bfe89">
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47289 from ueshin/issues/SPARK-48862/logging.
    
    Authored-by: Takuya Ueshin <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    ueshin authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f5524e0 View commit details
    Browse the repository at this point in the history
  48. [SPARK-48459][CONNECT][PYTHON][FOLLOWUP] Ignore to_plan from with_origin

    ### What changes were proposed in this pull request?
    
    Ignores `connect.Column.to_plan` from `with_origin`.
    
    ### Why are the changes needed?
    
    Capturing call site on `connect.Column.to_plan` takes long time when creating proto plans if there are many `connect.Column` objects, although the call sites on `connect.Column.to_plan` are not necessary.
    
    E.g.,
    
    ```py
    from pyspark.sql import functions as F
    
    df = createDataFrame()
    def schema():
        return df.select(*([F.col("a"), F.col("b"), F.col("c"), F.col("d")] * 10)).schema
    
    profile(schema)
    ```
    
    <img width="1109" alt="Screenshot 2024-07-10 at 13 40 33" src="https://github.com/apache/spark/assets/506656/776978ce-bef9-47ef-b4a5-0d206683736d">
    
    The total function calls / duration for this is:
    
    - before
    
    ```
    28393570 function calls (28381720 primitive calls) in 3.450 seconds
    ```
    
    - after
    
    ```
    109970 function calls (98120 primitive calls) in 0.184 seconds
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47284 from ueshin/issues/SPARK-48459/query_context.
    
    Authored-by: Takuya Ueshin <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    ueshin authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    c88a8cd View commit details
    Browse the repository at this point in the history
  49. [SPARK-48763][FOLLOWUP] Make dev/lint-scala error message more accu…

    …rate
    
    ### What changes were proposed in this pull request?
    The pr is followuping apache#47157, to make `dev/lint-scala` error message more accurate.
    
    ### Why are the changes needed?
    After move from: `connector/connect/server` `connector/connect/common` to: `connect/server``connect/common`
    Our error message in `dev/lint-scala` should be updated synchronously.
    
    eg:
    <img width="709" alt="image" src="https://github.com/apache/spark/assets/15246973/d749e371-7621-4063-b512-279d0690d573">
    <img width="772" alt="image" src="https://github.com/apache/spark/assets/15246973/44b80571-bdb6-40cb-9571-8b34d009b5f8">
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    Manually test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47291 from panbingkun/SPARK-48763_FOLLOWUP.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    34707d8 View commit details
    Browse the repository at this point in the history
  50. [SPARK-48726][SS] Create the StateSchemaV3 file format, and write thi…

    …s out for the TransformWithStateExec operator
    
    ### What changes were proposed in this pull request?
    
    In this PR, we introduce the `StateSchemaV3` file that is used to keep track of a list of `ColumnFamilySchema` which we write from the `TransformWithState` operator. We collect the Column Family schemas from the driver, and write them out as a part of a planning rule.
    
    We will be introducing the OperatorStateMetadataV2 in the following PR: apache#47273
    This will integrate with the TransformWithState operator, and rely on the schema file.
    
    ### Why are the changes needed?
    
    These changes are needed to enable schema evolution for this operator in the future.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Added unit tests and ran existing unit tests
    ```
    [info] Run completed in 11 seconds, 673 milliseconds.
    [info] Total number of tests run: 4
    [info] Suites: completed 1, aborted 0
    [info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
    [info] All tests passed.
    [success] Total time: 43 s, completed Jun 26, 2024, 10:38:35 AM
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47104 from ericm-db/state-schema-tws.
    
    Lead-authored-by: Eric Marnadi <[email protected]>
    Co-authored-by: Eric Marnadi <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    7d9cab2 View commit details
    Browse the repository at this point in the history
  51. [SPARK-48529][SQL] Introduction of Labels in SQL Scripting

    ### What changes were proposed in this pull request?
    Previous [PR1](apache#46665) and [PR2](apache#46665) introduced parser and interpreter changes for SQL Scripting. This PR is a follow-up to introduce the concept of labels for SQL Scripting language and proposes the following changes:
    
    - Changes grammar to support labels at start and end of the compound statements.
    - Updates visitor functions for compound nodes in the syntax tree in AstBuilder to check if labels are present and valid.
    
    More details can be found in [Jira item](https://issues.apache.org/jira/browse/SPARK-48529) for this task and its parent (where the design doc is uploaded as well).
    
    ### Why are the changes needed?
    The intent is to add support for various SQL scripting concepts like loops, leave & iterate statements.
    
    ### Does this PR introduce any user-facing change?
    No.
    This PR is among first PRs in series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
    In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.
    
    ### How was this patch tested?
    There are tests for newly introduced parser changes:
    
    SqlScriptingParserSuite - unit tests for execution nodes.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47146 from miland-db/sql_batch_labels.
    
    Lead-authored-by: David Milicevic <[email protected]>
    Co-authored-by: Milan Dankovic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    66c71f0 View commit details
    Browse the repository at this point in the history
  52. [SPARK-48858][PYTHON] Remove deprecated setDaemon method call of `T…

    …hread` in `log_communication.py`
    
    ### What changes were proposed in this pull request?
    
    This PR aims to remove deprecated `setDaemon` method call of `Thread` in `log_communication.py`. This is last one used.
    
    ### Why are the changes needed?
    
    Clean up deprecated apis.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47282 from wayneguow/remove_py_dep.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    37db253 View commit details
    Browse the repository at this point in the history
  53. [SPARK-48280][SQL][FOLLOWUP] Improve collation testing surface area u…

    …sing expression walking
    
    ### What changes were proposed in this pull request?
    Followup: small correction.
    
    ### Why are the changes needed?
    UTF8_BINARY_LCASE no longer exists in Spark.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47216 from uros-db/fix-walker.
    
    Authored-by: Uros Bojanic <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    uros-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    48f5ae3 View commit details
    Browse the repository at this point in the history
  54. [SPARK-48851][SQL] Change the value of SCHEMA_NOT_FOUND from `names…

    …pace` to `catalog.namespace`
    
    ### What changes were proposed in this pull request?
    The pr aims to change the value of `SCHEMA_NOT_FOUND` from `namespace` to `catalog.namespace`.
    
    ### Why are the changes needed?
    As discussing apache#47038 (comment), we should provide more friendly and clear prompt error message.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Update existed UT & Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47276 from panbingkun/db_with_catalog.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b503837 View commit details
    Browse the repository at this point in the history
  55. [SPARK-48863][SQL] Fix ClassCastException when parsing JSON with "spa…

    …rk.sql.json.enablePartialResults" enabled
    
    <!--
    Thanks for sending a pull request!  Here are some tips for you:
      1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
      2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
      3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
      4. Be sure to keep the PR description updated to reflect all changes.
      5. Please write your PR title to summarize what this PR proposes.
      6. If possible, provide a concise example to reproduce the issue for a faster review.
      7. If you want to add a new configuration, please read the guideline first for naming configurations in
         'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
      8. If you want to add or modify an error type or message, please read the guideline first in
         'common/utils/src/main/resources/error/README.md'.
    -->
    
    ### What changes were proposed in this pull request?
    <!--
    Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
    If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
      1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
      2. If you fix some SQL features, you can provide some references of other DBMSes.
      3. If there is design documentation, please add the link.
      4. If there is a discussion in the mailing list, please add the link.
    -->
    
    This PR fixes a bug in a corner case of JSON parsing when `spark.sql.json.enablePartialResults` is enabled.
    
    When running the following query with the config set to true:
    ```
    select from_json('{"a":"b","c":"d"}', 'array<struct<a:string, c:int>>')
    ```
    the code would fail with
    ```
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure:
    Lost task 0.0 in stage 4.0 (TID 4) (ip-10-110-51-101.us-west-2.compute.internal executor driver):
    java.lang.ClassCastException: class org.apache.spark.unsafe.types.UTF8String cannot be cast to class
    org.apache.spark.sql.catalyst.util.ArrayData (org.apache.spark.unsafe.types.UTF8String and
    org.apache.spark.sql.catalyst.util.ArrayData are in unnamed module of loader 'app')
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:53)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:53)
        at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:172)
        at org.apache.spark.sql.catalyst.expressions.JsonToStructs.$anonfun$converter$2(jsonExpressions.scala:831)
        at org.apache.spark.sql.catalyst.expressions.JsonToStructs.nullSafeEval(jsonExpressions.scala:893)
    ```
    
    The patch fixes the issue by re-throwing PartialArrayDataResultException if parsing fails in this special case.
    
    ### Why are the changes needed?
    <!--
    Please clarify why the changes are needed. For instance,
      1. If you propose a new API, clarify the use case for a new API.
      2. If you fix a bug, you can clarify why it is a bug.
    -->
    
    Fixes the bug that would prevent users from reading objects as arrays as introduced in SPARK-19595. This is more of a special case but it works with the flag off so it would be good to fix it when the flag is on.
    
    ### Does this PR introduce _any_ user-facing change?
    <!--
    Note that it means *any* user-facing change including all aspects such as the documentation fix.
    If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
    If no, write 'No'.
    -->
    
    Yes, but it is a bug fix so it would not have worked without this patch overall.
    The parsing output will be different due to the partial results improvement:
    
    Previously, we would get `null` (the partial results are disabled). With this patch and partial results enabled, this will return `Array([b, null])`. This is not specific to this patch but rather to the partial results feature in general.
    
    ### How was this patch tested?
    <!--
    If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
    If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
    If tests were not added, please describe why they were not added and/or why it was difficult to add.
    If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
    -->
    
    I added a unit test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    <!--
    If generative AI tooling has been used in the process of authoring this patch, please include the
    phrase: 'Generated-by: ' followed by the name of the tool and its version.
    If no, write 'No'.
    Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
    -->
    
    No.
    
    Closes apache#47292 from sadikovi/SPARK-48863.
    
    Authored-by: Ivan Sadikov <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    sadikovi authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    6d4d5ba View commit details
    Browse the repository at this point in the history
  56. [SPARK-48866][SQL] Fix hints of valid charset in the error message of…

    … INVALID_PARAMETER_VALUE.CHARSET
    
    ### What changes were proposed in this pull request?
    
    This PR fixes hints at the error message of INVALID_PARAMETER_VALUE.CHARSET. The current error message does not enumerate all valid charsets, e.g. UTF-32.
    
    This PR parameterizes it to fix this issue.
    
    ### Why are the changes needed?
    Bugfix, the hint w/ charsets missing is not helpful
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, error message changing
    
    ### How was this patch tested?
    modified tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47295 from yaooqinn/SPARK-48866.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: yangjie01 <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    278f173 View commit details
    Browse the repository at this point in the history
  57. [SPARK-48773] Document config "spark.default.parallelism" by config b…

    …uilder framework
    
    ### What changes were proposed in this pull request?
    
    Document config "spark.default.parallelism". This is Spark used config but not documented by config builder framework. This config is already in spark website: https://spark.apache.org/docs/latest/configuration.html.
    
    ### Why are the changes needed?
    
    Document Spark's config.
    
    ### Does this PR introduce _any_ user-facing change?
    
    NO.
    
    ### How was this patch tested?
    
    N/A
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    N/A
    
    Closes apache#47171 from amaliujia/document_spark_default_paramllel.
    
    Authored-by: Rui Wang <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    amaliujia authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    2f96a37 View commit details
    Browse the repository at this point in the history
  58. [SPARK-48793][SQL][TESTS][S] Unify v1 and v2 ALTER TABLE .. DROP|RE…

    …NAME` COLUMN ...` tests
    
    ### What changes were proposed in this pull request?
    The pr aims to:
    - Move parser tests from `o.a.s.s.c.p.DDLParserSuite` and `o.a.s.s.c.p.ErrorParserSuite` to `AlterTableRenameColumnParserSuite` & `AlterTableDropColumnParserSuite`
    - Add a test for DSv2 ALTER TABLE .. `DROP|RENAME` to `v2.AlterTableDropColumnSuite` & `v2.AlterTableRenameColumnSuite`
    
    (This PR includes the unification of two commands: `DROP COLUMN` & `RENAME COLUMN`)
    
    ### Why are the changes needed?
    - To improve test coverage.
    - Align with other similar tests, eg: AlterTableRename*
    
    ### Does this PR introduce _any_ user-facing change?
    No, only tests.
    
    ### How was this patch tested?
    - Add new UT
    - Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47199 from panbingkun/alter_table_drop_column.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    3877b99 View commit details
    Browse the repository at this point in the history
  59. [SPARK-46738][PYTHON] Reenable a group of doctests

    ### What changes were proposed in this pull request?
    the `cast` issue has been resolved in apache#47249 , then we can reenable a group of doctests
    
    ### Why are the changes needed?
    test coverage
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47302 from zhengruifeng/enable_more_doctest.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b73bbf3 View commit details
    Browse the repository at this point in the history
  60. [MINOR][SQL][TESTS] Remove a duplicate test case in CSVExprUtilsSuite

    ### What changes were proposed in this pull request?
    
    This PR aims to remove a duplicate test case in `CSVExprUtilsSuite`.
    
    ### Why are the changes needed?
    
    Clean duplicate code.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47298 from wayneguow/csv_suite.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    03aa592 View commit details
    Browse the repository at this point in the history
  61. [SPARK-48775][SQL][STS] Replace SQLContext with SparkSession in STS

    ### What changes were proposed in this pull request?
    
    Remove the exposed `SQLContext` which was added in SPARK-46575. And migrate STS internal used `SQLContext` to `SparkSession`.
    
    ### Why are the changes needed?
    
    `SQLContext` is not recommended since Spark 2.0, the suggested replacement is `SparkSession`. We should avoid exposing the deprecated class to Developer API in new versions.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. It touched the Developer API added in SPARK-46575, but is not released yet.
    
    ### How was this patch tested?
    
    Pass GHA, and `dev/mima` (not breaking changes involved)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47176 from pan3793/SPARK-48775.
    
    Lead-authored-by: Cheng Pan <[email protected]>
    Co-authored-by: Cheng Pan <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9b5c00b View commit details
    Browse the repository at this point in the history
  62. [SPARK-48623][CORE] Structured logging migrations [Part 3]

    ### What changes were proposed in this pull request?
    This PR makes additional Scala logging migrations to comply with the scala style changes in apache#46947
    
    ### Why are the changes needed?
    This makes development and PR review of the structured logging migration easier.
    
    ### Does this PR introduce any user-facing change?
    No
    
    ### How was this patch tested?
    Tested by ensuring dev/scalastyle checks pass
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47275 from asl3/formatstructuredlogmigrations.
    
    Lead-authored-by: Amanda Liu <[email protected]>
    Co-authored-by: Gengliang Wang <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    47e1d96 View commit details
    Browse the repository at this point in the history
  63. [SPARK-48850][DOCS][SS][SQL] Add documentation for new options added …

    …to State Data Source
    
    ### What changes were proposed in this pull request?
    
    In apache#46944 and apache#47188, we introduced some new options to the State Data Source. This PR aims to explain these new features in the documentation.
    
    ### Why are the changes needed?
    
    It is necessary to reflect the latest change in the documentation website.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The API Doc website can be rendered correctly.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47274 from eason-yuchen-liu/snapshot-doc.
    
    Authored-by: Yuchen Liu <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    eason-yuchen-liu authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    16465db View commit details
    Browse the repository at this point in the history
  64. [SPARK-48717][FOLLOWUP][PYTHON][SS] Catch Cancelled Job Group wrapped…

    … by Py4JJavaError in StreamExecution
    
    ### What changes were proposed in this pull request?
    
    The previous commit apache@1581264 doesn't capture the situation when a job group is cancelled. This patches that situation.
    
    ### Why are the changes needed?
    
    Bug fix, without this change, calling query.stop() would sometimes (when there is a python foreachBatch function, and this error is thrown) cause query appears as failed.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Added unit test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47307 from WweiL/SPARK-48717-job-cancel.
    
    Authored-by: Wei Liu <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    WweiL authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    c235bf5 View commit details
    Browse the repository at this point in the history
  65. [SPARK-48852][CONNECT] Fix string trim function in connect

    ### What changes were proposed in this pull request?
    
    Changed the order of arguments passed in the connect client's trim function call to match [`sql/core/src/main/scala/org/apache/spark/sql/functions.scala`](https://github.com/apache/spark/blob/f2dd0b3338a6937bbfbea6cd5fffb2bf9992a1f3/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L4322)
    
    ### Why are the changes needed?
    
    This change fixes a correctness bug in spark connect where a query to trim characters `s` from a column will be replaced by a substring of `s`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Updated golden files for [`/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala`](https://github.com/apache/spark/blob/f2dd0b3338a6937bbfbea6cd5fffb2bf9992a1f3/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala#L1815) and added an additional test to verify correctness.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47277 from biruktesf-db/fix-trim-connect.
    
    Authored-by: Biruk Tesfaye <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    biruktesf-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    8115e6e View commit details
    Browse the repository at this point in the history
  66. [SPARK-48872][PYTHON] Reduce the overhead of _capture_call_site

    ### What changes were proposed in this pull request?
    
    Reduces the overhead of `inspect.stack` in `_capture_call_site` by inlining `inspect.stack` with using `generator` instead of `list`.
    Also, specify `context=0` for `inspect.getframeinfo` to avoid unnecessary field retrievals.
    
    ### Why are the changes needed?
    
    The `_capture_call_site` has inevitable overhead when `Column` operations happen a lot, but it can be reduced.
    
    E.g.,
    
    ```py
    from functools import reduce
    
    def alias_schema():
        return df.select(reduce(lambda x, y: x.alias(f"col_a_{y}"), range(20), F.col("a"))).schema
    profile(alias_schema)
    ```
    
    <img width="1106" alt="Screenshot 2024-07-11 at 15 24 31" src="https://github.com/user-attachments/assets/1c677f56-86be-4e8f-9dd2-45c4c2c167f3">
    
    The function calls and duration are:
    
    - before
    
    ```
            18013240 function calls (18012760 primitive calls) in 2.327 seconds
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    ...
          200    0.001    0.000    2.231    0.011 /.../python/pyspark/errors/utils.py:164(_capture_call_site)
    
    ```
    
    - after
    
    ```
             1421240 function calls (1420760 primitive calls) in 0.265 seconds
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    ...
          200    0.001    0.000    0.182    0.001 /.../python/pyspark/errors/utils.py:165(_capture_call_site)
    
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    The existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47308 from ueshin/issues/SPARK-48872/inspect_stack.
    
    Authored-by: Takuya Ueshin <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    ueshin authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f890152 View commit details
    Browse the repository at this point in the history
  67. [SPARK-46743][SQL][FOLLOW UP] Count bug after ScalarSubqery is folded…

    … if it has an empty relation
    
    ### What changes were proposed in this pull request?
    
    In this PR apache#45125, we handled the case where an Aggregate is folded into a Project, causing a count bug. We missed cases where:
    1. The entire ScalarSubquery's plan is regarded as empty relation, and is folded completely.
    2. There are operations above the Aggregate in the subquery (such as filter and project).
    
    ### Why are the changes needed?
    
    This PR fixes that by adding the case handling in ConstantFolding and OptimizeSubqueries.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. There was a correctness error which happens when the scalar subquery is count-bug-susceptible, and empty, and thus folded by `ConstantFolding`.
    
    ### How was this patch tested?
    
    Added SQL query tests in `scalar-subquery-count-bug.sql`.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47290 from andylam-db/decorr-bugs.
    
    Authored-by: Andy Lam <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    andylam-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5bdb3f6 View commit details
    Browse the repository at this point in the history
  68. [SPARK-48841][SQL] Include collationName to sql() of Collate

    ### What changes were proposed in this pull request?
    In the PR, I propose to fix the `sql()` method of the `Collate` expression, and append the `collationName` clause.
    
    ### Why are the changes needed?
    To distinguish column names when the `collationName` argument is used by `collate`. Before the changes, columns might conflict like the example below, and that could confuse users:
    ```
    sql("CREATE TEMP VIEW tbl as (SELECT collate('A', 'UTF8_BINARY'), collate('A', 'UTF8_LCASE'))")
    ```
    - Before:
    ```
    [COLUMN_ALREADY_EXISTS] The column `collate(a)` already exists. Choose another name or rename the existing column. SQLSTATE: 42711
    org.apache.spark.sql.AnalysisException: [COLUMN_ALREADY_EXISTS] The column `collate(a)` already exists. Choose another name or rename the existing column. SQLSTATE: 42711
    	at org.apache.spark.sql.errors.QueryCompilationErrors$.columnAlreadyExistsError(QueryCompilationErrors.scala:2595)
    	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:115)
    	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:97)
    ```
    
    - After:
    ```
    describe extended tbl;
    +-----------------------+-------------------------+-------+
    |col_name               |data_type                |comment|
    +-----------------------+-------------------------+-------+
    |collate(A, UTF8_BINARY)|string                   |NULL   |
    |collate(A, UTF8_LCASE) |string collate UTF8_LCASE|NULL   |
    +-----------------------+-------------------------+-------+
    
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    Should not.
    
    ### How was this patch tested?
    Update existed UT.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47265 from panbingkun/SPARK-48841.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    a856b47 View commit details
    Browse the repository at this point in the history
  69. [SPARK-46654][DOCS][FOLLOW-UP] Remove obsolete TODO item

    ### What changes were proposed in this pull request?
    Remove obsolete TODO item
    
    ### Why are the changes needed?
    the `Example 2` test had been already enabled
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47312 from zhengruifeng/simple_folloup.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9316401 View commit details
    Browse the repository at this point in the history
  70. [SPARK-48760][SQL] Fix CatalogV2Util.applyClusterByChanges

    ### What changes were proposed in this pull request?
    
    apache#47156 introduced a bug in `CatalogV2Util.applyClusterByChanges` that it will remove the existing `ClusterByTransform` first, regardless of whether there is a `ClusterBy` table change. This means any table change will remove the clustering columns from the table.
    
    This PR fixes the bug by removing the `ClusterByTransform` only when there is a `ClusterBy` table change.
    
    ### Why are the changes needed?
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    ### How was this patch tested?
    
    Amend existing test to catch this bug.
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47288 from zedtang/fix-apply-cluster-by-changes.
    
    Authored-by: Jiaheng Tang <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    zedtang authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    226758d View commit details
    Browse the repository at this point in the history
  71. [SPARK-48874][SQL][DOCKER][BUILD][TESTS] Upgrade MySQL docker image…

    … version to `9.0.0`
    
    ### What changes were proposed in this pull request?
    The pr aims to upgrade `MySQ`L docker image version from `8.4.0` to `9.0.0`.
    
    ### Why are the changes needed?
    After https://issues.apache.org/jira/browse/SPARK-48795, we have upgraded the `mysql jdbc driver` version to `9.0.0` for testing, so I propose that the corresponding `mysql server docker image` should also be upgraded to `9.0.0`
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47311 from panbingkun/mysql_image_9.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    1b63009 View commit details
    Browse the repository at this point in the history
  72. Configuration menu
    Copy the full SHA
    fd34224 View commit details
    Browse the repository at this point in the history
  73. Revert "Remove unused test jar (apache#47309)"

    This reverts commit b560e4e.
    HyukjinKwon authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    a4a20e0 View commit details
    Browse the repository at this point in the history
  74. [SPARK-48842][DOCS] Document non-determinism of max_by and min_by

    ### What changes were proposed in this pull request?
    Document non-determinism of max_by and min_by
    
    ### Why are the changes needed?
    I have been confused by this non-determinism twice, it occurred like a correctness bug to me.
    So I think we need to document it
    
    ### Does this PR introduce _any_ user-facing change?
    doc change only
    
    ### How was this patch tested?
    ci
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47266 from zhengruifeng/py_doc_max_by.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    c57f87f View commit details
    Browse the repository at this point in the history
  75. [SPARK-48871] Fix INVALID_NON_DETERMINISTIC_EXPRESSIONS validation in…

    … CheckAnalysis
    
    ### What changes were proposed in this pull request?
    
    The PR added a trait that logical plans can extend to implement a method to decide whether there can be non-deterministic expressions for the operator, and check this method in checkAnalysis.
    
    ### Why are the changes needed?
    
    I encountered the `INVALID_NON_DETERMINISTIC_EXPRESSIONS` exception when attempting to use a non-deterministic udf in my query. The non-deterministic expression can be safely allowed for my custom LogicalPlan, but it is disabled in the checkAnalysis phase. The CheckAnalysis rule is too strict so that reasonable use cases of non-deterministic expressions are also disabled.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    The test case `"SPARK-48871: AllowsNonDeterministicExpression allow lists non-deterministic expressions"` is added.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47304 from zhipengmao-db/zhipengmao-db/SPARK-48871-check-analysis.
    
    Lead-authored-by: zhipeng.mao <[email protected]>
    Co-authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    e42ea5c View commit details
    Browse the repository at this point in the history
  76. [SPARK-48864][SQL][TESTS] Refactor HiveQuerySuite and fix bug

    ### What changes were proposed in this pull request?
    The pr aims to refactor `HiveQuerySuite` and `fix` bug, includes:
    - use `getWorkspaceFilePath` to enable `HiveQuerySuite` to run successfully in the IDE.
    - make the test `lookup hive UDF in another thread` `independence`, without relying on the previous UT `current_database with multiple sessions`.
    - enable two test: `non-boolean conditions in a CaseWhen are illegal` and `Dynamic partition folder layout`.
    
    ### Why are the changes needed?
    - Run successfully in the `IDE`
      Before:
      <img width="1288" alt="image" src="https://github.com/apache/spark/assets/15246973/005fd49c-3edf-4e51-8223-097fd7a485bf">
    
      After:
      <img width="1276" alt="image" src="https://github.com/apache/spark/assets/15246973/caedec72-be0c-4bb5-bc06-26cceef8b4b8">
    
    - Make UT `lookup hive UDF in another thread` `independence`
      when `only` running it, it actually failed with the following error:
      <img width="1318" alt="image" src="https://github.com/apache/spark/assets/15246973/ef9c260f-8c0d-4821-8233-d4d7ae13802a">
    
      **why ?**
      Because the previous UT `current_database with multiple sessions`  changed `current database` and was not restored after it finished running.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    - Manually test
    - Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47293 from panbingkun/refactor_HiveQuerySuite.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: yangjie01 <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    6437433 View commit details
    Browse the repository at this point in the history
  77. [SPARK-48877][PYTHON][DOCS] Test the default column name of array fun…

    …ctions
    
    ### What changes were proposed in this pull request?
    Test the default column name of array functions
    
    ### Why are the changes needed?
    for test coverage, sometime the default column name is a problem
    
    ### Does this PR introduce _any_ user-facing change?
    doc changes
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47318 from zhengruifeng/py_avoid_alias_array_func.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    05f2614 View commit details
    Browse the repository at this point in the history
  78. [MINOR][TESTS] Remove unused test jar (udf_noA.jar)

    ### What changes were proposed in this pull request?
    
    This jar was added in apache#42069 but moved in apache#43735.
    
    ### Why are the changes needed?
    
    To clean up a jar not used.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing tests should check
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47315 from HyukjinKwon/minor-cleanup-jar-2.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Martin Grund <[email protected]>
    HyukjinKwon authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b5a78a4 View commit details
    Browse the repository at this point in the history
  79. [SPARK-48794][CONNECT] df.mergeInto support for Spark Connect (Scala …

    …and Python)
    
    ### What changes were proposed in this pull request?
    
    This PR introduces `df.mergeInto` support for Spark Connect Scala and Python clients.
    
    This work contains four components:
    
    1. New Protobuf messages: command `MergeIntoTableCommand` and expression `MergeAction`.
    2. Spark Connect planner change: translate proto messages into real `MergeIntoCommand`s.
    3. Connect Scala client: `MetgeIntoWriter` that allows users to build merges.
    4. Connect Python client: `MetgeIntoWriter` that allows users to build merges.
    
    Components 3 and 4 and independent to each other. They both depends on Component 1.
    
    ### Why are the changes needed?
    
    We need to increase the functionality of Spark Connect to be on par with Classic.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, new Dataframe APIs are introduced.
    
    ### How was this patch tested?
    
    Added new tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#46960 from xupefei/merge-builder.
    
    Authored-by: Paddy Xu <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    xupefei authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5337e52 View commit details
    Browse the repository at this point in the history
  80. [SPARK-48876][BUILD] Upgrade Guava used by the connect module to 33.2…

    ….1-jre
    
    ### What changes were proposed in this pull request?
    The pr aims to upgrade Guava used by the `connect` module to `33.2.1-jre`.
    
    ### Why are the changes needed?
    The new version bring some fixes and changes as follows:
    - Changed InetAddress-String conversion methods to preserve the IPv6 scope ID if present. The scope ID can be necessary for IPv6-capable devices with multiple network interfaces.
    - Added HttpHeaders constants Ad-Auction-Allowed, Permissions-Policy-Report-Only, and Sec-GPC
    - Fixed a potential NullPointerException in ImmutableMap.Builder on a rare code path。
    
    The full release notes:
    - https://github.com/google/guava/releases/tag/v33.2.0
    - https://github.com/google/guava/releases/tag/v33.2.1
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47296 from LuciferYang/connect-guava-33.2.1.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    LuciferYang authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5a03410 View commit details
    Browse the repository at this point in the history
  81. [SPARK-48878][PYTHON][DOCS] Add doctests for options in json functions

    ### What changes were proposed in this pull request?
    Add doctests for `options` in json functions
    
    ### Why are the changes needed?
    test coverage, we never test `options` in `from_json` and `to_json` before
    
    since it is a new underlying implementation in Spark Connect, we should explicitly test it
    
    ### Does this PR introduce _any_ user-facing change?
    doc changes
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47319 from zhengruifeng/from_json_option.
    
    Lead-authored-by: Kent Yao <[email protected]>
    Co-authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    4927f63 View commit details
    Browse the repository at this point in the history
  82. [SPARK-48666][SQL] Do not push down filter if it contains PythonUDFs

    ### What changes were proposed in this pull request?
    
    This PR proposes to prevent pushing down Python UDFs. This PR uses the same approach as apache#47033, therefore added the author as a co-author, but simplifies the change.
    
    Extracting filters to push down happens first
    
    https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L46
    
    https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L211
    
    https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L51
    
    Before extracting Python UDFs
    
    https://github.com/apache/spark/blob/cbe6846c477bc8b6d94385ddd0097c4e97b05d41/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkOptimizer.scala#L80
    
    Here is full stacktrace:
    
    ```
    [INTERNAL_ERROR] Cannot evaluate expression: pyUDF(cast(input[0, bigint, true] as string)) SQLSTATE: XX000
    org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: pyUDF(cast(input[0, bigint, true] as string)) SQLSTATE: XX000
    	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
    	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
    	at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:65)
    	at org.apache.spark.sql.catalyst.expressions.FoldableUnevaluable.eval(Expression.scala:387)
    	at org.apache.spark.sql.catalyst.expressions.FoldableUnevaluable.eval$(Expression.scala:386)
    	at org.apache.spark.sql.catalyst.expressions.PythonUDF.eval(PythonUDF.scala:72)
    	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:563)
    	at org.apache.spark.sql.catalyst.expressions.IsNotNull.eval(nullExpressions.scala:403)
    	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:53)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1(ExternalCatalogUtils.scala:189)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$1$adapted(ExternalCatalogUtils.scala:188)
    	at scala.collection.immutable.List.filter(List.scala:516)
    	at scala.collection.immutable.List.filter(List.scala:79)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.prunePartitionsByFilter(ExternalCatalogUtils.scala:188)
    	at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.listPartitionsByFilter(InMemoryCatalog.scala:604)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
    	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:1358)
    	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.listPartitionsByFilter(ExternalCatalogUtils.scala:168)
    	at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:74)
    	at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:72)
    	at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:50)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470)
    	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:84)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:470)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)
    	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:330)
    	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:326)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:475)
    	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren(TreeNode.scala:1251)
    	at org.apache.spark.sql.catalyst.trees.BinaryLike.mapChildren$(TreeNode.scala:1250)
    	at org.apache.spark.sql.catalyst.plans.logical.Join.mapChildren(basicLogicalOperators.scala:552)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:475)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)
    	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:330)
    	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:326)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
    	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
    	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:446)
    	at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:50)
    	at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:35)
    	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:226)
    	at scala.collection.LinearSeqOps.foldLeft(LinearSeq.scala:183)
    	at scala.collection.LinearSeqOps.foldLeft$(LinearSeq.scala:179)
    	at scala.collection.immutable.List.foldLeft(List.scala:79)
    	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:223)
    	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:215)
    	at scala.collection.immutable.List.foreach(List.scala:334)
    	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:215)
    	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:186)
    	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:89)
    	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:186)
    	at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:167)
    	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:138)
    	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:234)
    	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:608)
    	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:234)
    	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)
    	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:233)
    	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:163)
    	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:159)
    	at org.apache.spark.sql.execution.python.PythonUDFSuite.$anonfun$new$19(PythonUDFSuite.scala:136)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)
    	at org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:307)
    	at org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:305)
    	at org.apache.spark.sql.execution.python.PythonUDFSuite.withTable(PythonUDFSuite.scala:25)
    	at org.apache.spark.sql.execution.python.PythonUDFSuite.$anonfun$new$18(PythonUDFSuite.scala:130)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
    	at org.scalatest.enablers.Timed$$anon$1.timeoutAfter(Timed.scala:127)
    	at org.scalatest.concurrent.TimeLimits$.failAfterImpl(TimeLimits.scala:282)
    	at org.scalatest.concurrent.TimeLimits.failAfter(TimeLimits.scala:231)
    	at org.scalatest.concurrent.TimeLimits.failAfter$(TimeLimits.scala:230)
    	at org.apache.spark.SparkFunSuite.failAfter(SparkFunSuite.scala:69)
    	at org.apache.spark.SparkFunSuite.$anonfun$test$2(SparkFunSuite.scala:155)
    	at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
    	at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
    	at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    	at org.scalatest.Transformer.apply(Transformer.scala:22)
    	at org.scalatest.Transformer.apply(Transformer.scala:20)
    	at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
    	at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:227)
    	at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
    	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
    	at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
    	at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
    	at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
    	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:69)
    	at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
    	at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
    	at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:69)
    	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
    	at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
    	at scala.collection.immutable.List.foreach(List.scala:334)
    	at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
    	at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
    	at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
    	at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269)
    	at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268)
    	at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564)
    	at org.scalatest.Suite.run(Suite.scala:1114)
    	at org.scalatest.Suite.run$(Suite.scala:1096)
    	at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564)
    	at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273)
    	at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
    	at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273)
    	at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272)
    	at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:69)
    	at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
    	at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
    	at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
    	at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69)
    	at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:47)
    	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1321)
    	at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1315)
    	at scala.collection.immutable.List.foreach(List.scala:334)
    	at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1315)
    	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:992)
    	at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:970)
    	at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1481)
    	at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:970)
    	at org.scalatest.tools.Runner$.run(Runner.scala:798)
    	at org.scalatest.tools.Runner.run(Runner.scala)
    	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2or3(ScalaTestRunner.java:43)
    	at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:26)
    ```
    
    ### Why are the changes needed?
    
    In order for end users to use Python UDFs against partitioned columns.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, this fixes a bug - this PR allows to use Python UDF in partitioned columns.
    
    ### How was this patch tested?
    
    Unittest added.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47033
    
    Closes apache#47313 from HyukjinKwon/SPARK-48666.
    
    Lead-authored-by: Hyukjin Kwon <[email protected]>
    Co-authored-by: Wei Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    32e5f39 View commit details
    Browse the repository at this point in the history
  83. [SPARK-48845][SQL] GenericUDF catch exceptions from children

    ### What changes were proposed in this pull request?
    This pr is trying to fix the syntax issues with GenericUDF since 3.5.0. The problem arose from DeferredObject currently passing a value instead of a function, which prevented users from catching exceptions in GenericUDF, resulting in semantic differences.
    
    Here is an example case we encountered. Originally, the semantics were that udf_exception would throw an exception, while udf_catch_exception could catch the exception and return a null value. However, currently, any exception encountered by udf_exception will cause the program to fail.
    ```
    select udf_catch_exception(udf_exception(col1)) from table
    ```
    
    ### Why are the changes needed?
    For before Spark 3.5, we directly made the GenericUDF's DeferredObject lazy and evaluated the children in `function.evaluate(deferredObjects)`.
    Now, we would run the children's code first. If an exception is thrown, we would make it lazy to GenericUDF's DeferredObject.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Newly added UT.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47268 from jackylee-ch/generic_udf_catch_exception_from_child_func.
    
    Lead-authored-by: jackylee-ch <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b95d6bc View commit details
    Browse the repository at this point in the history
  84. [SPARK-47307][SQL] Add a config to optionally chunk base64 strings

    Follow up apache#45408
    
    ### What changes were proposed in this pull request?
    [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings
    
    ### Why are the changes needed?
    In apache#35110, it was incorrectly asserted that:
    
    > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt
    
    This is not true as the previous code called:
    
    ```java
    public static byte[] encodeBase64(byte[] binaryData)
    ```
    
    Which states:
    
    > Encodes binary data using the base64 algorithm but does not chunk the output.
    
    However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Existing test suite.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47303 from wForget/SPARK-47307.
    
    Lead-authored-by: Ted Jenks <[email protected]>
    Co-authored-by: wforget <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Co-authored-by: Ted Chester Jenks <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    3 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    bec6844 View commit details
    Browse the repository at this point in the history
  85. [SPARK-48510] 2/2] Support UDAF toColumn API in Spark Connect

    ### What changes were proposed in this pull request?
    
    This PR follows apache#46245 to add support `udaf.toColumn` API in Spark Connect.
    
    Here we introduce a new Protobuf message, `proto.TypedAggregateExpression`, that includes a serialized UDF packet. On the server, we unpack it into an `Aggregator` object and generate a real `TypedAggregateExpression` instance with the encoder information passed along with the UDF.
    
    ### Why are the changes needed?
    
    Because the `toColumn` API is not supported in the previous PR.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, from now on users could create typed UDAF using `udaf.toColumn` API/.
    
    ### How was this patch tested?
    
    New tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Nope.
    
    Closes apache#46849 from xupefei/connect-udaf-tocolumn.
    
    Authored-by: Paddy Xu <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    xupefei authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    bc3133c View commit details
    Browse the repository at this point in the history
  86. [SPARK-48883][ML][R] Replace RDD read / write API invocation with Dat…

    …aframe read / write API
    
    ### What changes were proposed in this pull request?
    
    Replace RDD read / write API invocation with Dataframe read / write API
    
    ### Why are the changes needed?
    
    In databricks runtime, RDD read / write API has some issue for certain storage types that requires the account key, but Dataframe read / write API works.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47328 from WeichenXu123/ml-df-writer-save-2.
    
    Authored-by: Weichen Xu <[email protected]>
    Signed-off-by: Weichen Xu <[email protected]>
    WeichenXu123 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    8704625 View commit details
    Browse the repository at this point in the history
  87. [SPARK-48440][SQL] Fix StringTranslate behaviour for non-UTF8_BINARY …

    …collations
    
    ### What changes were proposed in this pull request?
    String searching in UTF8_LCASE now works on character-level, rather than on byte-level. For example: `translate("İ", "i")` now returns `"İ"`, because there exists no **single character** in `"İ"` such that lowercased version of that character equals to `"i"`. Note, however, that there _is_ a byte subsequence of `"İ"` such that lowercased version of that UTF-8 byte sequence equals to `"i"` (so the new behaviour is different than the old behaviour).
    
    Also, translation for ICU collations works by repeatedly translating the longest possible substring that matches a key in the dictionary (under the specified collation), starting from the left side of the input string, until the entire string is translated.
    
    ### Why are the changes needed?
    Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_BINARY_LCASE (see example above).
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, behaviour of `translate` expression is changed for edge cases with one-to-many case mapping.
    
    ### How was this patch tested?
    New unit tests in `CollationStringExpressionsSuite`.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#46761 from uros-db/alter-translate.
    
    Authored-by: Uros Bojanic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    uros-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    d055894 View commit details
    Browse the repository at this point in the history
  88. [SPARK-47911][SQL][FOLLOWUP] Rename UTF8 to UTF-8 in spark.sql.binary…

    …OutputStyle
    
    ### What changes were proposed in this pull request?
    
    Make a follow-up for SPARK-47911 to rename UTF8 to UTF-8 of `spark.sql.binaryOutputStyle`, so that we could have a consistent name with `org.apache.spark.sql.catalyst.util.CharsetProvider.VALID_CHARSETS` and `java.nio.charset.StandardCharsets.UTF_8`
    
    ### Why are the changes needed?
    
    reduce cognitive cost for users
    
    ### Does this PR introduce _any_ user-facing change?
    no, unreleased feature
    
    ### How was this patch tested?
    existing tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47322 from yaooqinn/SPARK-47911-FF.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    16b616c View commit details
    Browse the repository at this point in the history
  89. [SPARK-48887][K8S] Enable `spark.kubernetes.executor.checkAllContaine…

    …rs` by default
    
    ### What changes were proposed in this pull request?
    
    This PR aims to enable `spark.kubernetes.executor.checkAllContainers` by default from Apache Spark 4.0.0.
    
    ### Why are the changes needed?
    
    Since Apache Spark 3.1.0, `spark.kubernetes.executor.checkAllContainers` is supported and useful because [sidecar pattern](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) is used in many cases. Also, it prevents user mistakes which forget and ignore the sidecars' failures by always reporting sidecar failures via executor status.
    - apache#29924
    
    ### Does this PR introduce _any_ user-facing change?
    
    - This configuration is no-op when there is no other container.
    - This will report user containers' error correctly when there exist other containers which are provided by the users.
    
    ### How was this patch tested?
    
    Both `true` and `false` are covered by our CI test coverage since Apache Spark 3.1.0.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47337 from dongjoon-hyun/SPARK-48887.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    1b8d135 View commit details
    Browse the repository at this point in the history
  90. [SPARK-48495][SQL][DOCS] Describe shredding scheme for Variant

    ### What changes were proposed in this pull request?
    
    For the Variant data type, we plan to add support for columnar storage formats (e.g. Parquet) to write the data shredded across multiple physical columns, and read only the data required for a given query. This PR merges a document describing the approach we plan to take. We can continue to update it as the implementation progresses.
    
    ### Why are the changes needed?
    
    When implemented, can allow much better performance when reading from columnar storage. More detail is given in the document.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    It is internal documentation, no testing should be needed.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#46831 from cashmand/SPARK-45891.
    
    Authored-by: cashmand <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    cashmand authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b5e4aec View commit details
    Browse the repository at this point in the history
  91. Revert "[SPARK-48883][ML][R] Replace RDD read / write API invocation …

    …with Dataframe read / write API"
    
    This reverts commit 0fa5787.
    HyukjinKwon authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    c404244 View commit details
    Browse the repository at this point in the history
  92. [SPARK-48895][R][INFRA] Use R 4.4.1 in windows R GitHub Action job

    ### What changes were proposed in this pull request?
    
    This PR aims to use R 4.4.1 in `windows` R GitHub Action job.
    
    ### Why are the changes needed?
    
    R 4.4.1 is the latest release which is released on 2024-06-14
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47346 from dongjoon-hyun/SPARK-48895.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    dongjoon-hyun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    2549087 View commit details
    Browse the repository at this point in the history
  93. [SPARK-48714][SPARK-48794][FOLLOW-UP][PYTHON][DOCS] Add mergeInto t…

    …o API reference
    
    ### What changes were proposed in this pull request?
    Add `mergeInto` to API reference
    
    ### Why are the changes needed?
    this feature was missing in doc
    
    ### Does this PR introduce _any_ user-facing change?
    yes, doc change
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47329 from zhengruifeng/py_doc_merge_into.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    6c41d75 View commit details
    Browse the repository at this point in the history
  94. [SPARK-48613][SQL] SPJ: Support auto-shuffle one side + less join key…

    …s than partition keys
    
    ### What changes were proposed in this pull request?
    
    This is the final planned SPJ scenario:  auto-shuffle one side + less join keys than partition keys.  Background:
    
    - Auto-shuffle works by creating ShuffleExchange for the non-partitioned side, with a clone of the partitioned side's KeyGroupedPartitioning.
    - "Less join key than partition key" works by 'projecting' all partition values by join keys (ie, keeping only partition columns that are join columns).  It makes a target KeyGroupedShuffleSpec with 'projected' partition values, and then pushes this down to BatchScanExec.  The BatchScanExec then 'groups' its projected partition value (except in the skew case but that's a different story..).
    
    This combination is hard because the SPJ planning calls is spread in several places in this scenario.  Given two sides, a non-partitioned side and a partitioned side, and the join keys are only a subset:
    
    1.  EnsureRequirements creates the target KeyGroupedShuffleSpec from the join's required distribution (ie, using only the join keys, not all partition keys).
    2.  EnsureRequirements copies this to the non-partitoned side's KeyGroupedPartition (for the auto-shuffle case)
    3.  BatchScanExec groups the partitions (for the partitioned side), including by join keys (if they differ from partition keys).
    
    Take the example partition columns (id, name) , and partition values: (1, "bob"), (2, "alice"), (2, "sam").
    Projection leaves us (1, 2, 2), and the final grouped partition values are (1, 2).
    
    The problem is, that the two sides of the join do not match at all times.  After the steps 1 and 2, the partitioned side has the 'projected' partition values (1, 2, 2), and the non-partitioned side creates a matching KeyGroupedPartitioning (1, 2, 2) for ShuffleExechange.  But on step 3, the BatchScanExec for partitioned side groups the partitions to become (1, 2), but the non-partitioned side does not group and still retains (1, 2, 2) partitions.  This leads to following assert error from the join:
    
    ```
    requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions.
    java.lang.IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions.
    	at scala.Predef$.require(Predef.scala:337)
    	at org.apache.spark.sql.catalyst.plans.physical.PartitioningCollection.<init>(partitioning.scala:550)
    	at org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning(ShuffledJoin.scala:49)
    	at org.apache.spark.sql.execution.joins.ShuffledJoin.outputPartitioning$(ShuffledJoin.scala:47)
    	at org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputPartitioning(SortMergeJoinExec.scala:39)
    	at org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$ensureDistributionAndOrdering$1(EnsureRequirements.scala:66)
    	at scala.collection.immutable.Vector1.map(Vector.scala:2140)
    	at scala.collection.immutable.Vector1.map(Vector.scala:385)
    	at org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$ensureDistributionAndOrdering(EnsureRequirements.scala:65)
    	at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:657)
    	at org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$1.applyOrElse(EnsureRequirements.scala:632)
    ```
    
    The fix is to do the de-duplication in first pass.
    
    1. Pushing down join keys to the BatchScanExec to return a de-duped outputPartitioning (partitioned side)
    2. Creating the non-partitioned side's KeyGroupedPartitioning with de-duped partition keys (non-partitioned side).
    
      ### Why are the changes needed?
    
    This is the last planned scenario for SPJ not yet supported.
    
      ### How was this patch tested?
    Update existing unit test in KeyGroupedPartitionSuite
    
      ### Was this patch authored or co-authored using generative AI tooling?
     No.
    
    Closes apache#47064 from szehon-ho/spj_less_join_key_auto_shuffle.
    
    Authored-by: Szehon Ho <[email protected]>
    Signed-off-by: Chao Sun <[email protected]>
    szehon-ho authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b80ee03 View commit details
    Browse the repository at this point in the history
  95. [SPARK-48834][SQL] Disable variant input/output to python scalar UDFs…

    …, UDTFs, UDAFs during query compilation
    
    ### What changes were proposed in this pull request?
    
    Throws an exception if a variant is the input/output type to/from python UDF, UDAF, UDTF
    
    ### Why are the changes needed?
    
    currently, variant input/output types to scalar UDFs will fail during execution or return a `net.razorvine.pickle.objects.ClassDictConstructor` to the user python code. For a better UX, we should fail during query compilation for failures, and block returning `ClassDictConstructor` to user code as we one day want to actually return `VariantVal`s to the user code.
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes - attempting to use variants in python UDFs will now throw an exception rather than returning a `ClassDictConstructor` as before. However, we want to make this change now as we one day want to be able to return `VariantVal`s to the user code and do not want users relying on this current behavior
    
    ### How was this patch tested?
    
    added UTs
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes apache#47253 from richardc-db/variant_scalar_udfs.
    
    Authored-by: Richard Chen <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    richardc-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    ca38071 View commit details
    Browse the repository at this point in the history
  96. [SPARK-48888][SS] Remove snapshot creation based on changelog ops size

    ### What changes were proposed in this pull request?
    Remove snapshot creation based on changelog ops size
    
    ### Why are the changes needed?
    Current mechanism to create snapshot is based on num batches or num ops in changelog. However, the latter is not configurable and might not be analogous to large snapshot sizes in all cases leading to variance in e2e latency. Hence, removing this condition for now.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Augmented unit tests
    
    ```
    ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common...
    [info] Run completed in 5 minutes, 7 seconds.
    [info] Total number of tests run: 176
    [info] Suites: completed 1, aborted 0
    [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0
    [info] All tests passed.
    [success] Total time: 332 s (05:32), completed Jul 12, 2024, 2:46:44 PM
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47338 from anishshri-db/task/SPARK-48888.
    
    Authored-by: Anish Shrigondekar <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    anishshri-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    01886a0 View commit details
    Browse the repository at this point in the history
  97. [SPARK-48880][CORE] Avoid throw NullPointerException if driver plugin…

    … fails to initialize
    
    ### What changes were proposed in this pull request?
    
    This pr skips clear memoryStore if memoryManager is null. This could happen if driver plugin fails to initialize, since we initialize MemoryManager after DriverPlugin.
    
    ### Why are the changes needed?
    
    before it would throw:
    ```
    {"class":"java.lang.NullPointerException","msg":"Cannot invoke \"org.apache.spark.memory.MemoryManager.maxOnHeapStorageMemory()\" because \"this.memoryManager\" is null","stacktrace":[{"class":"org.apache.spark.storage.memory.MemoryStore","method":"maxMemory","file":"MemoryStore.scala","line":110},
    {"class":"org.apache.spark.storage.memory.MemoryStore","method":"<init>","file":"MemoryStore.scala","line":113},
    {"class":"org.apache.spark.storage.BlockManager","method":"memoryStore$lzycompute","file":"BlockManager.scala","line":234},
    {"class":"org.apache.spark.storage.BlockManager","method":"memoryStore","file":"BlockManager.scala","line":233},
    {"class":"org.apache.spark.storage.BlockManager","method":"stop","file":"BlockManager.scala","line":2167},
    {"class":"org.apache.spark.SparkEnv","method":"stop","file":"SparkEnv.scala","line":118},
    {"class":"org.apache.spark.SparkContext","method":"$anonfun$stop$25","file":"SparkContext.scala","line":2369},
    {"class":"org.apache.spark.util.Utils$","method":"tryLogNonFatalError","file":"Utils.scala","line":1299},
    {"class":"org.apache.spark.SparkContext","method":"stop","file":"SparkContext.scala","line":2369}
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    manually test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes apache#47321 from ulysses-you/minor.
    
    Authored-by: ulysses-you <[email protected]>
    Signed-off-by: youxiduo <[email protected]>
    ulysses-you authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f52a87b View commit details
    Browse the repository at this point in the history
  98. [SPARK-48894][TESTS] Upgrade docker-java to 3.4.0

    ### What changes were proposed in this pull request?
    
    This PR aims to upgrade `docker-java` to 3.4.0.
    
    ### Why are the changes needed?
    
    There some improvements, such as:
    
    - Enhancements
    
    Enable protocol configuration of SSLContext (docker-java/docker-java#2337)
    
    - Bug Fixes
    
    Consider already existing images as successful pulls (docker-java/docker-java#2335)
    
    Full release notes:
    https://github.com/docker-java/docker-java/releases/tag/3.4.0
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47344 from wayneguow/SPARK-48894.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f34ec9f View commit details
    Browse the repository at this point in the history
  99. [SPARK-48463][ML] Make StringIndexer supporting nested input columns

    ### What changes were proposed in this pull request?
    
    Make StringIndexer supporting nested input columns
    
    ### Why are the changes needed?
    
    User demand.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes.
    
    ### How was this patch tested?
    
    Unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Closes apache#47283 from WeichenXu123/SPARK-48463.
    
    Lead-authored-by: Weichen Xu <[email protected]>
    Co-authored-by: WeichenXu <[email protected]>
    Signed-off-by: Weichen Xu <[email protected]>
    WeichenXu123 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    e438890 View commit details
    Browse the repository at this point in the history
  100. [SPARK-48441][SQL] Fix StringTrim behaviour for non-UTF8_BINARY colla…

    …tions
    
    ### What changes were proposed in this pull request?
    String searching in UTF8_LCASE now works on character-level, rather than on byte-level. For example: `ltrim("İ", "i")` now returns `"İ"`, because there exist **no characters** in `"İ"`, starting from the left, such that lowercased version of those characters are equal to `"i"`. Note, however, that there is a byte subsequence of `"İ"` such that lowercased version of that UTF-8 byte sequence equals to `"i"` (so the new behaviour is different than the old behaviour).
    
    Also, translation for ICU collations works by repeatedly trimming the longest possible substring that matches a character in the trim string, starting from the left side of the input string, until trimming is done.
    
    ### Why are the changes needed?
    Fix functions that give unusable results due to one-to-many case mapping when performing string search under UTF8_LCASE (see example above).
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, behaviour of `trim*` expressions is changed for collated strings for edge cases with one-to-many case mapping.
    
    ### How was this patch tested?
    New unit tests in `CollationSupportSuite` and new e2e sql tests in `CollationStringExpressionsSuite`.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#46762 from uros-db/alter-trim.
    
    Authored-by: Uros Bojanic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    uros-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    424a9d8 View commit details
    Browse the repository at this point in the history
  101. [MINOR][SQL][TESTS] Fix compilation warning `adaptation of an empty a…

    …rgument list by inserting () is deprecated`
    
    ### What changes were proposed in this pull request?
    The pr aims to fix  compilation warning: `adaptation of an empty argument list by inserting () is deprecated`
    
    ### Why are the changes needed?
    Fix compilation warning.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Manually check.
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47350 from panbingkun/ParquetCommitterSuite_deprecated.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: yangjie01 <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    c812281 View commit details
    Browse the repository at this point in the history
  102. [SPARK-48899][K8S] Fix ENV key value format in K8s Dockerfiles

    ### What changes were proposed in this pull request?
    
    This PR aims to fix `ENV` key value format in K8s Dockerfiles.
    
    ### Why are the changes needed?
    
    To follow the Docker guideline to fix the following legacy format.
    - https://docs.docker.com/reference/build-checks/legacy-key-value-format/
    ```
    - LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass the CIs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47357 from dongjoon-hyun/SPARK-48899.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    1f1a2f6 View commit details
    Browse the repository at this point in the history
  103. [SPARK-48886][SS] Add version info to changelog v2 to allow for easie…

    …r evolution
    
    ### What changes were proposed in this pull request?
    Add version info to changelog v2 to allow for easier evolution
    
    ### Why are the changes needed?
    Currently the changelog file format does not add the version info. With format v2, we propose to add this to the changelog file itself to make future evolution easier.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Augmented unit tests
    ```
    ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common...
    [info] Run completed in 4 minutes, 23 seconds.
    [info] Total number of tests run: 176
    [info] Suites: completed 1, aborted 0
    [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0
    [info] All tests passed.
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47336 from anishshri-db/task/SPARK-48886.
    
    Authored-by: Anish Shrigondekar <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    anishshri-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9847ceb View commit details
    Browse the repository at this point in the history
  104. [SPARK-48350][SQL] Introduction of Custom Exceptions for Sql Scripting

    ### What changes were proposed in this pull request?
    Previous PRs introduced basic changes for SQL Scripting. This PR is a follow-up to introduce custom exceptions that can arise while using SQL Scripting language.
    
    ### Why are the changes needed?
    The intent is to add precise errors for various SQL scripting concepts.
    
    ### Does this PR introduce any user-facing change?
    Users will now see specific SQL Scripting language errors.
    
    ### How was this patch tested?
    There are tests for newly introduced parser changes:
    
    SqlScriptingParserSuite - unit tests for execution nodes.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47147 from miland-db/sql_batch_custom_errors.
    
    Lead-authored-by: Milan Dankovic <[email protected]>
    Co-authored-by: David Milicevic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    1a365b5 View commit details
    Browse the repository at this point in the history
  105. [SPARK-45155][CONNECT] Add API Docs for Spark Connect JVM/Scala Client

    This PR is based on apache#42911.
    
    ### What changes were proposed in this pull request?
    
    - Enables Scala and Java Unidoc generation for the `connectClient` project.
    - Generates docs and moves them to the `docs/api/connect` folder.
    
    Some methods' documentation in the connect directory had to be modified to remove references to avoid javadoc generation failures. **References API docs in the main index page and the global floating header will be added in a later PR.**
    
    ### Why are the changes needed?
    
    Increasing scope of documentation for the Spark Connect JVM/Scala Client project.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Nope.
    
    ### How was this patch tested?
    
    Manual test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47332 from xupefei/connnect-doc-web.
    
    Authored-by: Paddy Xu <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    xupefei authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9a42009 View commit details
    Browse the repository at this point in the history
  106. [SPARK-47172][DOCS][FOLLOWUP] Fix spark.network.crypto.ciphersince ve…

    …rsion field on security page
    
    ### What changes were proposed in this pull request?
    
    Given that SPARK-47172 was an improvement but got merged into 3.4/3.5, we need to fix the since version to eliminate misunderstandings.
    
    ### Why are the changes needed?
    
    doc fix
    
    ### Does this PR introduce _any_ user-facing change?
    
    no
    
    ### How was this patch tested?
    
    doc build
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47353 from yaooqinn/SPARK-47172.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    edeefaa View commit details
    Browse the repository at this point in the history
  107. [SPARK-48885][SQL] Make some subclasses of RuntimeReplaceable overrid…

    …e replacement to lazy val
    
    ### What changes were proposed in this pull request?
    
    This PR makes 8 subclasses of RuntimeReplaceable override replacement to lazy val to align with other 60+ members and avoid recreation of new expressions
    
    ```scala
    Value read  (51 usages found)
                spark-catalyst_2.13  (50 usages found)
                    AnyValue.scala  (1 usage found)
                        54 override lazy val replacement: Expression = First(child, ignoreNulls)
                    arithmetic.scala  (1 usage found)
                        127 override lazy val replacement: Expression = child
                    bitmapExpressions.scala  (3 usages found)
                        52 override lazy val replacement: Expression = StaticInvoke(
                        85 override lazy val replacement: Expression = StaticInvoke(
                        134 override lazy val replacement: Expression = StaticInvoke(
                    boolAggregates.scala  (2 usages found)
                        39 override lazy val replacement: Expression = Min(child)
                        61 override lazy val replacement: Expression = Max(child)
                    collationExpressions.scala  (1 usage found)
                        123 override def replacement: Expression = {
                    collectionOperations.scala  (5 usages found)
                        168 override lazy val replacement: Expression = Size(child, legacySizeOfNull = false)
                        231 override lazy val replacement: Expression = ArrayContains(MapKeys(left), right)
                        1596 override lazy val replacement: Expression = new ArrayInsert(left, Literal(1), right)
                        1631 override lazy val replacement: Expression = new ArrayInsert(left, Literal(-1), right)
                        5203 override lazy val replacement: Expression = ArrayFilter(child, lambda)
                    CountIf.scala  (1 usage found)
                        42 override lazy val replacement: Expression = Count(new NullIf(child, Literal.FalseLiteral))
                    datetimeExpressions.scala  (2 usages found)
                        2070 override lazy val replacement: Expression = format.map { f =>
                        2145 override lazy val replacement: Expression = format.map { f =>
                    linearRegression.scala  (5 usages found)
                        45 override lazy val replacement: Expression = Count(Seq(left, right))
                        79 override lazy val replacement: Expression =
                        114 override lazy val replacement: Expression =
                        176 override lazy val replacement: Expression =
                        232 override lazy val replacement: Expression =
                    misc.scala  (3 usages found)
                        294 override lazy val replacement: Expression = StaticInvoke(
                        397 override lazy val replacement: Expression = StaticInvoke(
                        475 override lazy val replacement: Expression = StaticInvoke(
                    percentiles.scala  (2 usages found)
                        346 override def replacement: Expression = percentile
                        365 override def replacement: Expression = percentile
                    regexpExpressions.scala  (3 usages found)
                        262 override lazy val replacement: Expression = Like(Lower(left), Lower(right), escapeChar)
                        1034 override lazy val replacement: Expression =
                        1072 override lazy val replacement: Expression =
                    stringExpressions.scala  (14 usages found)
                        561 override lazy val replacement =
                        723 override lazy val replacement: Expression = Invoke(input, "isValid", BooleanType)
                        770 override lazy val replacement: Expression = Invoke(input, "makeValid", input.dataType)
                        810 override lazy val replacement: Expression = StaticInvoke(
                        859 override lazy val replacement: Expression = StaticInvoke(
                        1854 override lazy val replacement: Expression = StaticInvoke(
                        2246 override lazy val replacement: Expression = If(
                        2284 override lazy val replacement: Expression = Substring(str, Literal(1), len)
                        2713 override def replacement: Expression = StaticInvoke(
                        2940 override def replacement: Expression = StaticInvoke(
                        3004 override val replacement: Expression = StaticInvoke(
                        3075 override lazy val replacement: Expression = if (fmt == null) {
                        3473 override lazy val replacement: Expression =
                        3533 override lazy val replacement: Expression = StaticInvoke(
                    toFromAvroSqlFunctions.scala  (2 usages found)
                        96 override def replacement: Expression = {
                        168 override def replacement: Expression = {
                    urlExpressions.scala  (2 usages found)
                        55 override def replacement: Expression =
                        92 override def replacement: Expression =
                    variantExpressions.scala  (3 usages found)
                        58 override lazy val replacement: Expression = StaticInvoke(
                        100 override lazy val replacement: Expression = StaticInvoke(
                        635 override lazy val replacement: Expression = StaticInvoke(
                spark-examples_2.13  (1 usage found)
                    AgeExample.scala  (1 usage found)
                        27 override lazy val replacement: Expression = SubtractDates(CurrentDate(), birthday)
    ```
    
    ### Why are the changes needed?
    
    Improve RuntimeReplaceable implementations
    
    ### Does this PR introduce _any_ user-facing change?
    NO
    
    ### How was this patch tested?
    existing tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47333 from yaooqinn/SPARK-48885.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b751eed View commit details
    Browse the repository at this point in the history
  108. [SPARK-48884][PYTHON] Remove unused helper function `PythonSQLUtils.m…

    …akeInterval`
    
    ### What changes were proposed in this pull request?
    Remove unused helper function `PythonSQLUtils.makeInterval`
    
    ### Why are the changes needed?
    As a followup cleanup of apache@bd14d64
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    NO
    
    Closes apache#47330 from zhengruifeng/py_sql_utils_cleanup.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    0afb112 View commit details
    Browse the repository at this point in the history
  109. [SPARK-45190][SPARK-48897][PYTHON][CONNECT] Make from_xml support S…

    …tructType schema
    
    ### What changes were proposed in this pull request?
    Make `from_xml` support StructType schema
    
    ### Why are the changes needed?
    StructType schema was supported in Spark Classic, but not in Spark Connect
    
    to address apache#43680 (comment)
    
    ### Does this PR introduce _any_ user-facing change?
    
    before:
    ```
    from pyspark.sql.types import StructType, LongType
    import pyspark.sql.functions as sf
    data = [(1, '''<p><a>1</a></p>''')]
    df = spark.createDataFrame(data, ("key", "value"))
    
    schema = StructType().add("a", LongType())
    df.select(sf.from_xml(df.value, schema)).show()
    
    ---------------------------------------------------------------------------
    AnalysisException                         Traceback (most recent call last)
    Cell In[1], line 7
    ...
    AnalysisException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601
    
    JVM stacktrace:
    org.apache.spark.sql.AnalysisException
    	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(parsers.scala:278)
    	at org.apache.spark.sql.catalyst.parser.AbstractParser.parse(parsers.scala:98)
    	at org.apache.spark.sql.catalyst.parser.AbstractParser.parseDataType(parsers.scala:40)
    	at org.apache.spark.sql.types.DataType$.$anonfun$fromDDL$1(DataType.scala:126)
    	at org.apache.spark.sql.types.DataType$.parseTypeWithFallback(DataType.scala:145)
    	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:127)
    ```
    
    after:
    ```
    +---------------+
    |from_xml(value)|
    +---------------+
    |            {1}|
    +---------------+
    
    ```
    
    ### How was this patch tested?
    added doctest and enabled unit tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes apache#47355 from zhengruifeng/from_xml_struct.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    7122c0b View commit details
    Browse the repository at this point in the history
  110. [SPARK-48846][PYTHON][DOCS][FOLLOWUP] Add a missing param doc in pyth…

    …on api `partitioning` functions docs
    
    ### What changes were proposed in this pull request?
    
    Add a missing param in func docs of `partitioning.py`.
    
    ### Why are the changes needed?
    
    - Make python api docs better.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA and docs check.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47345 from wayneguow/py_f_docs.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    ac58556 View commit details
    Browse the repository at this point in the history
  111. [SPARK-48902][BUILD] Upgrade commons-codec to 1.17.1

    ### What changes were proposed in this pull request?
    The pr aims to upgrade `commons-codec` from `1.17.0` to `1.17.1`.
    
    ### Why are the changes needed?
    The full release notes: https://commons.apache.org/proper/commons-codec/changes-report.html#a1.17.1
    This version has fixed some bugs from the previous version, eg:
    - Md5Crypt now throws IllegalArgumentException on an invalid prefix
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47362 from panbingkun/SPARK-48902.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    09a66c7 View commit details
    Browse the repository at this point in the history
  112. [SPARK-48873][SQL] Use UnsafeRow in JSON parser

    ### What changes were proposed in this pull request?
    
    It uses `UnsafeRow` to represent struct result in the JSON parser. It saves memory compared to the current `GenericInternalRow`. The change is guarded by a flag and disabled by default.
    
    The benchmark shows that enabling the flag brings ~10% slowdown. This is basically expected because converting to `UnsafeRow` requires some work. The purpose of the PR is to provide an alternative to save memory.
    
    I did the following experiment. It generates a big `.gz` JSON file containing a single large array. Each array element is a struct with 50 string fields and will be parsed into a row by the JSON reader.
    
    ```
    s = b'{"field00":null,"field01":"field01_<v>","field02":"field02_<v>","field03":"field03_<v>","field04":"field04_<v>","field05":"field05_<v>","field06":"field06_<v>","field07":"field07_<v>","field08":"field08_<v>","field09":"field09_<v>","field10":null,"field11":"field11_<v>","field12":"field12_<v>","field13":"field13_<v>","field14":"field14_<v>","field15":"field15_<v>","field16":"field16_<v>","field17":"field17_<v>","field18":"field18_<v>","field19":"field19_<v>","field20":null,"field21":"field21_<v>","field22":"field22_<v>","field23":"field23_<v>","field24":"field24_<v>","field25":"field25_<v>","field26":"field26_<v>","field27":"field27_<v>","field28":"field28_<v>","field29":"field29_<v>","field30":null,"field31":"field31_<v>","field32":"field32_<v>","field33":"field33_<v>","field34":"field34_<v>","field35":"field35_<v>","field36":"field36_<v>","field37":"field37_<v>","field38":"field38_<v>","field39":"field39_<v>","field40":null,"field41":"field41_<v>","field42":"field42_<v>","field43":"field43_<v>","field44":"field44_<v>","field45":"field45_<v>","field46":"field46_<v>","field47":"field47_<v>","field48":"field48_<v>","field49":"field49_<v>"}'
    
    import gzip
    
    def write(n):
      with gzip.open(f'json{n}.gz', 'w') as f:
        f.write(b'[')
        for i in range(n):
            if i != 0:
                f.write(b',')
            f.write(s.replace(b'<v>', str(i).encode('ascii')))
        f.write(b']')
    
    write(100000)
    ```
    
    Then it processes the file in Spark shell with the following command:
    
    ```
    ./bin/spark-shell --conf spark.driver.memory=1g --conf spark.executor.memory=1g  --master "local[1]"
    
    > val schema = "field00 string, field01 string, field02 string, field03 string, field04 string, field05 string, field06 string, field07 string, field08 string, field09 string, field10 string, field11 string, field12 string, field13 string, field14 string, field15 string, field16 string, field17 string, field18 string, field19 string, field20 string, field21 string, field22 string, field23 string, field24 string, field25 string, field26 string, field27 string, field28 string, field29 string, field30 string, field31 string, field32 string, field33 string, field34 string, field35 string, field36 string, field37 string, field38 string, field39 string, field40 string, field41 string, field42 string, field43 string, field44 string, field45 string, field46 string, field47 string, field48 string, field49 string"
    > spark.conf.set("spark.sql.json.useUnsafeRow", "false")
    > spark.read.schema(schema).option("multiline", "true").json("json100000.gz").selectExpr("sum(hash(struct(*)))").collect()
    ```
    
    When the flag is off (the current behavior), the query can process 2.5e5 rows but fails to process 3e5 rows. When the flag is on, the query can process 8e5 rows but fails to process 9e5 rows. We can say this change reduces the memory consumption to about 1/3.
    
    ### Why are the changes needed?
    
    It reduces the memory requirement of JSON-related query.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    A new JSON unit test with the config flag on.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47310 from chenhao-db/json_unsafe_row.
    
    Authored-by: Chenhao Li <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    chenhao-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    8510760 View commit details
    Browse the repository at this point in the history
  113. [SPARK-48896][ML][MLLIB] Avoid repartition when writing out the metadata

    ### What changes were proposed in this pull request?
    
    This PR proposes to remove `repartition(1)` when writing metadata in ML/MLlib. It already writes one file.
    
    ### Why are the changes needed?
    
    In order to remove unnecessary shuffle, see also apache#47341
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests should verify them.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47347 from HyukjinKwon/SPARK-48896.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    HyukjinKwon authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    bb5267c View commit details
    Browse the repository at this point in the history
  114. [SPARK-48909][ML][MLLIB] Uses SparkSession over SparkContext when wri…

    …ting metadata
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to use SparkSession over SparkContext when writing metadata
    
    ### Why are the changes needed?
    
    See apache#47347 (comment)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests should cover it.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47366 from HyukjinKwon/SPARK-48909.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    HyukjinKwon authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    a402f51 View commit details
    Browse the repository at this point in the history
  115. [SPARK-48892][ML] Avoid per-row param read in Tokenizer

    ### What changes were proposed in this pull request?
    Inspired by apache#47258, I am checking other ML implementations, and find that we can also optimize `Tokenizer` in the same way
    
    ### Why are the changes needed?
    the function `createTransformFunc` is to build the udf for `UnaryTransformer.transform`:
    https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L118
    
    existing implementation read the params for each row.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    CI and manually tests:
    
    create test dataset
    ```
    spark.range(1000000).select(uuid().as("uuid")).write.mode("overwrite").parquet("/tmp/regex_tokenizer.parquet")
    ```
    
    duration
    ```
    val df = spark.read.parquet("/tmp/regex_tokenizer.parquet")
    import org.apache.spark.ml.feature._
    val tokenizer = new RegexTokenizer().setPattern("-").setInputCol("uuid")
    Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()) // warm up
    val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic
    ```
    
    result (before this PR)
    ```
    scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic
    val tic: Long = 1720613235068
    val res5: Long = 50397
    ```
    
    result (after this PR)
    ```
    scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic
    val tic: Long = 1720612871256
    val res5: Long = 43748
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47342 from zhengruifeng/opt_tokenizer.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    62c217a View commit details
    Browse the repository at this point in the history
  116. [SPARK-48883][ML][R] Replace RDD read / write API invocation with Dat…

    …aframe read / write API
    
    ### What changes were proposed in this pull request?
    
    This PR is a retry of apache#47328 which replaces RDD to Dataset to write SparkR metadata plus this PR removes `repartition(1)`. We actually don't need this when the input is single row as it creates only single partition:
    
    https://github.com/apache/spark/blob/e5e751b98f9ef5b8640079c07a9a342ef471d75d/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala#L49-L57
    
    ### Why are the changes needed?
    
    In order to leverage Catalyst optimizer and SQL engine. For example, now we leverage UTF-8 encoding instead of plain JDK ser/de for strings. We have made similar changes in the past, e.g., apache#29063, apache#15813, apache#17255 and SPARK-19918.
    
    Also, we remove `repartition(1)`. To avoid unnecessary shuffle.
    
    With `repartition(1)`:
    
    ```
    == Physical Plan ==
    AdaptiveSparkPlan isFinalPlan=false
    +- Exchange SinglePartition, REPARTITION_BY_NUM, [plan_id=6]
       +- LocalTableScan [_1#0]
    ```
    
    Without `repartition(1)`:
    
    ```
    == Physical Plan ==
    LocalTableScan [_1#2]
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    CI in this PR should verify the change
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47341 from HyukjinKwon/SPARK-48883-followup.
    
    Authored-by: Hyukjin Kwon <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    HyukjinKwon authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    c9c0ab2 View commit details
    Browse the repository at this point in the history
  117. [SPARK-48903][SS] Set the RocksDB last snapshot version correctly on …

    …remote load
    
    ### What changes were proposed in this pull request?
    Set the RocksDB last snapshot version correctly on remote load
    
    ### Why are the changes needed?
    Avoid creating full snapshot on every first batch after restart and also reset a snapshot that is likely no longer valid
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added unit tests
    ```
    ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.RocksDBSuite, threads: ForkJoinPool.commonPool-worker-6 (daemon=true), ForkJoinPool.commonPool-worker-4 (daemon=true), ForkJoinPool.commonPool-worker-7 (daemon=true), ForkJoinPool.commonPool-worker-5 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-8 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true), ForkJoinPool.common...
    [info] Run completed in 4 minutes, 40 seconds.
    [info] Total number of tests run: 176
    [info] Suites: completed 1, aborted 0
    [info] Tests: succeeded 176, failed 0, canceled 0, ignored 0, pending 0
    [info] All tests passed.
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47363 from anishshri-db/task/SPARK-48903.
    
    Authored-by: Anish Shrigondekar <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    anishshri-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b9254d0 View commit details
    Browse the repository at this point in the history
  118. [SPARK-48510][CONNECT][FOLLOW-UP] Fix for UDAF toColumn API when ru…

    …nning tests in Maven
    
    ### What changes were proposed in this pull request?
    
    This PR fixes an issue where the TypeTag look up during `udaf.toColumn` failed in Maven test env with the following error:
    
    >   java.lang.IllegalArgumentException: Type tag defined in [JavaMirror with jdk.internal.loader.ClassLoaders$AppClassLoader1dbd16a6 of type class jdk.internal.loader.ClassLoaders$AppClassLoader with classpath [<unknown>] and parent being jdk.internal.loader.ClassLoaders$PlatformClassLoader6bd61f98 of type class jdk.internal.loader.ClassLoaders$PlatformClassLoader with classpath [<unknown>] and parent being primordial classloader with boot classpath [<unknown>]] cannot be migrated to another mirror [JavaMirror <ins>with java.net.URLClassLoader5a4041cc of type class java.net.URLClassLoader with classpath [file:/\<redacted\>/spark/connector/connect/client/jvm/target/scala-2.13/classes/,file:/\<redacted\>/spark/connector/connect/client/jvm/target/scala-2.13/test-classes/]</ins> and parent being jdk.internal.loader.ClassLoaders$AppClassLoader1dbd16a6 of type class jdk.internal.loader.ClassLoaders$AppClassLoader with classpath [<unknown>] and parent being jdk.internal.loader.ClassLoaders$PlatformClassLoader6bd61f98 of type class jdk.internal.loader.ClassLoaders$PlatformClassLoader with classpath [<unknown>] and parent being primordial classloader with boot classpath [<unknown>]].
    
    The problem is caused by Maven adding a `URLClassLoader` on top of the original `AppClassLoader` (see the underlined texts in the above error message).
    
    This PR changes the mirror-matching logic from `eq` to `hasCommonAncestors`.
    
    ### Why are the changes needed?
    
    Previous logic fails in tests env.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47368 from xupefei/udaf-tocolumn-fixup.
    
    Authored-by: Paddy Xu <[email protected]>
    Signed-off-by: Haejoon Lee <[email protected]>
    xupefei authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    66eaa7f View commit details
    Browse the repository at this point in the history
  119. [SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the behavior …

    …change of base64 function
    
    <!--
    Thanks for sending a pull request!  Here are some tips for you:
      1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
      2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
      3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
      4. Be sure to keep the PR description updated to reflect all changes.
      5. Please write your PR title to summarize what this PR proposes.
      6. If possible, provide a concise example to reproduce the issue for a faster review.
      7. If you want to add a new configuration, please read the guideline first for naming configurations in
         'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
      8. If you want to add or modify an error type or message, please read the guideline first in
         'common/utils/src/main/resources/error/README.md'.
    -->
    
    ### What changes were proposed in this pull request?
    
    Follow up to apache#47303
    
    Add a migration guide for the behavior change of `base64` function
    
    ### Why are the changes needed?
    <!--
    Please clarify why the changes are needed. For instance,
      1. If you propose a new API, clarify the use case for a new API.
      2. If you fix a bug, you can clarify why it is a bug.
    -->
    
    ### Does this PR introduce _any_ user-facing change?
    <!--
    Note that it means *any* user-facing change including all aspects such as the documentation fix.
    If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
    If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
    If no, write 'No'.
    -->
    No
    
    ### How was this patch tested?
    <!--
    If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
    If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
    If tests were not added, please describe why they were not added and/or why it was difficult to add.
    If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
    -->
    doc change
    
    ### Was this patch authored or co-authored using generative AI tooling?
    <!--
    If generative AI tooling has been used in the process of authoring this patch, please include the
    phrase: 'Generated-by: ' followed by the name of the tool and its version.
    If no, write 'No'.
    Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
    -->
    No
    
    Closes apache#47371 from wForget/SPARK-47307_doc.
    
    Authored-by: wforget <[email protected]>
    Signed-off-by: allisonwang-db <[email protected]>
    wForget authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    e9c1474 View commit details
    Browse the repository at this point in the history
  120. [SPARK-48889][SS] testStream to unload state stores before finishing

    ### What changes were proposed in this pull request?
    In the end of each testStream() call, unload all state stores from the executor
    
    ### Why are the changes needed?
    Currently, after a test, we don't unload state store or disable maintenance task. So after a test, the maintenance task can run and fail as the checkpoint directory is already deleted. This might cause an issue and fail the next test.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    See existing tests to pass
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47339 from siying/SPARK-48889.
    
    Authored-by: Siying Dong <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    siying authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    cc9c851 View commit details
    Browse the repository at this point in the history
  121. [SPARK-48865][SQL] Add try_url_decode function

    ### What changes were proposed in this pull request?
    
    Add a `try_url_decode` function that performs the same operation as `url_decode`, but returns a NULL value instead of raising an error if the decoding cannot be performed.
    
    ### Why are the changes needed?
    
    In hive we usually do url decoding like: `reflect('java.net.URLDecoder', 'decode', 'test%1')`, and return a `NULL` value instead of raising an error if the decoding cannot be performed.
    
    Although spark provides a `try_reflect` function to do this, but as commented in apache#34023 (comment), the `reflect` function may cause partition pruning to does not take effect. So I propose to add a new `try_url_decode` function.
    
    ### Does this PR introduce _any_ user-facing change?
    
    add a new function
    
    ### How was this patch tested?
    
    added tests and did manual testing
    
    spark-sql:
    ![image](https://github.com/apache/spark/assets/17894939/0ffd3aa2-98f7-4af4-b478-67002b8b0d4b)
    
    pyspark:
    ![image](https://github.com/apache/spark/assets/17894939/d2c1926b-f9a0-422c-abc9-5f224d822811)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47294 from wForget/try_url_decode.
    
    Lead-authored-by: wforget <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f99e678 View commit details
    Browse the repository at this point in the history
  122. [SPARK-48923][SQL][TESTS] Fix the incorrect logic of `CollationFactor…

    …ySuite`
    
    ### What changes were proposed in this pull request?
    The pr aims to fix the incorrect logic of `CollationFactorySuite`.
    
    ### Why are the changes needed?
    Only fix `CollationFactorySuite`.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Update existed UT.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47382 from panbingkun/fix_CollationFactorySuite.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    8ad09d4 View commit details
    Browse the repository at this point in the history
  123. [SPARK-48907][SQL] Fix the value explicitTypes in `COLLATION_MISMAT…

    …CH.EXPLICIT`
    
    ### What changes were proposed in this pull request?
    The pr aims to
    - fix the value `explicitTypes` in `COLLATION_MISMATCH.EXPLICIT`.
    - use `checkError` to check exception in `CollationSQLExpressionsSuite` and `CollationStringExpressionsSuite`.
    
    ### Why are the changes needed?
    Only fix bug, eg:
    ```
    SELECT concat_ws(' ', collate('Spark', 'UTF8_LCASE'), collate('SQL', 'UNICODE'))
    ```
    
    - Before:
      ```
      [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use for string functions and operators. Error occurred due to the mismatch between explicit collations: `string collate UTF8_LCASE`.`string collate UNICODE`. Decide on a single explicit collation and remove others. SQLSTATE: 42P21
      ```
      <img width="747" alt="image" src="https://github.com/user-attachments/assets/4e026cb5-2875-4370-9bb9-878f0b607f41">
    
    - After:
      ```
      [COLLATION_MISMATCH.EXPLICIT] Could not determine which collation to use for string functions and operators. Error occurred due to the mismatch between explicit collations: [`string collate UTF8_LCASE`, `string collate UNICODE`]. Decide on a single explicit collation and remove others. SQLSTATE: 42P21
      ```
      <img width="738" alt="image" src="https://github.com/user-attachments/assets/86f489a2-9f2d-4f59-bdb1-95c051a93ee8">
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Updated existed UT.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47365 from panbingkun/SPARK-48907.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    aa5d0e0 View commit details
    Browse the repository at this point in the history
  124. [SPARK-48927][CORE] Show the number of cached RDDs in StoragePage

    ### What changes were proposed in this pull request?
    
    This PR aims to show the number of cached RDDs in `StoragePage` like the other `Jobs` page or `Stages` page.
    
    ### Why are the changes needed?
    
    To improve the UX by providing additional summary information in a consistent way.
    
    **BEFORE**
    
    ![Screenshot 2024-07-17 at 09 46 44](https://github.com/user-attachments/assets/3e57bf91-e97d-404d-aeda-159ab9cb65e3)
    
    **AFTER**
    
    ![Screenshot 2024-07-17 at 09 46 01](https://github.com/user-attachments/assets/d416ea16-8255-48d8-ade4-624dcac8f46e)
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Manual review.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47390 from dongjoon-hyun/SPARK-48927.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    47de346 View commit details
    Browse the repository at this point in the history
  125. [SPARK-48930][CORE] Redact awsAccessKeyId by including accesskey

    …pattern
    
    ### What changes were proposed in this pull request?
    
    This PR aims to redact `awsAccessKeyId` by including `accesskey` pattern.
    
    - **Apache Spark 4.0.0-preview1**
    There is no point to redact `fs.s3a.access.key` because the same value is exposed via `fs.s3.awsAccessKeyId` like the following. We need to redact all.
    
    ```
    $ AWS_ACCESS_KEY_ID=A AWS_SECRET_ACCESS_KEY=B bin/spark-shell
    ```
    
    ![Screenshot 2024-07-17 at 12 45 44](https://github.com/user-attachments/assets/e3040c5d-3eb9-4944-a6d6-5179b7647426)
    
    ### Why are the changes needed?
    
    Since Apache Spark 1.1.0, `AWS_ACCESS_KEY_ID` is propagated like the following. However, Apache Spark does not redact them all consistently.
    - apache#450
    
    https://github.com/apache/spark/blob/5d16c3134c442a5546251fd7c42b1da9fdf3969e/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L481-L486
    
    ### Does this PR introduce _any_ user-facing change?
    
    Users may see more redactions on configurations whose name contains `accesskey` case-insensitively. However, those configurations are highly likely to be related to the credentials.
    
    ### How was this patch tested?
    
    Pass the CIs with the newly added test cases.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47392 from dongjoon-hyun/SPARK-48930.
    
    Authored-by: Dongjoon Hyun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    dongjoon-hyun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    3f95864 View commit details
    Browse the repository at this point in the history
  126. [SPARK-48924][PS] Add a pandas-like make_interval helper function

    ### What changes were proposed in this pull request?
    Add a pandas-like `make_interval` helper function
    
    ### Why are the changes needed?
    factor it out as a helper function to be reusable
    
    ### Does this PR introduce _any_ user-facing change?
    No, internal change only
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47385 from zhengruifeng/ps_simplify_make_interval.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    425ada9 View commit details
    Browse the repository at this point in the history
  127. [SPARK-48510][CONNECT][FOLLOW-UP-MK2] Fix for UDAF toColumn API whe…

    …n running tests in Maven
    
    ### What changes were proposed in this pull request?
    
    This PR follows apache#47368 as another try to fix the broken tests. The previous try failed due to NPE, caused by `Iterator.iterate` generating an **infinite** flow of values.
    
    I can't reproduce the previous issue locally, so my fix is purely based on the error message: https://github.com/apache/spark/actions/runs/9974746135/job/27562881993.
    
    ### Why are the changes needed?
    
    Because previous one failed.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Locally.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47387 from xupefei/udaf-tocolumn-fixup-mk2.
    
    Authored-by: Paddy Xu <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    xupefei authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    45aca76 View commit details
    Browse the repository at this point in the history
  128. [SPARK-48926][SQL][TESTS] Use checkError method to optimize excepti…

    …on check logic related to `UNRESOLVED_COLUMN` error classes
    
    ### What changes were proposed in this pull request?
    
    This PR aims to use `checkError` method to optimize exception check logic related to `UNRESOLVED_COLUMN` error classes
    
    ### Why are the changes needed?
    
    Unify error classes check way to `checkError` method.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass related test cases.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47389 from wayneguow/op_un_col.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    d7605f5 View commit details
    Browse the repository at this point in the history
  129. [SPARK-48932][BUILD] Upgrade commons-lang3 to 3.15.0

    ### What changes were proposed in this pull request?
    The pr aims to upgrade `commons-lang3` from `3.14.0` to `3.15.0`
    
    ### Why are the changes needed?
    - v3.14.0 VS v3.15.0
      apache/commons-lang@rel/commons-lang-3.14.0...rel/commons-lang-3.15.0
    
    - The new version has brought some bug fixes, eg:
      apache/commons-lang#1140
      apache/commons-lang#1151
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47396 from panbingkun/SPARK-48932.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b0e535b View commit details
    Browse the repository at this point in the history
  130. [SPARK-48915][SQL][TESTS] Add some uncovered predicates(!=, <=, >, >=…

    …) in test cases of `GeneratedSubquerySuite`
    
    ### What changes were proposed in this pull request?
    
    This PR aims to add some predicates(!=, <=, >, >=) which are not covered in test cases of `GeneratedSubquerySuite`.
    
    ### Why are the changes needed?
    
    Better coverage of current subquery tests in `GeneratedSubquerySuite`.
    For more information about subqueries in `postgresq`, refer to:
    https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY
    https://www.postgresql.org/docs/current/functions-comparisons.html#ROW-WISE-COMPARISON
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA and Manual testing with `GeneratedSubquerySuite`.
    ![image](https://github.com/user-attachments/assets/4b265def-a7a9-405e-94ce-e9902efb79fa)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47386 from wayneguow/SPARK-48915.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    605b2f6 View commit details
    Browse the repository at this point in the history
  131. [SPARK-48900] Add reason field for cancelJobGroup and `cancelJobs…

    …WithTag`
    
    ### What changes were proposed in this pull request?
    
    This PR introduces the optional `reason` field for `cancelJobGroup` and `cancelJobsWithTag` in `SparkContext.scala`, while keeping the old APIs without the `reason`, similar to how `cancelJob` is implemented currently.
    
    ### Why are the changes needed?
    
    Today it is difficult to determine why a job, stage, or job group was canceled. We should leverage existing Spark functionality to provide a reason string explaining the cancellation cause, and should add new APIs to let us provide this reason when canceling job groups.
    
    **Details:**
    
    Since [SPARK-19549](https://issues.apache.org/jira/browse/SPARK-19549) Allow providing reasons for stage/job cancelling - ASF JIRA (Spark 2.20), Spark’s cancelJob and cancelStage methods accept an optional reason: String that is added to logging output and user-facing error messages when jobs or stages are canceled. In our internal calls to these methods, we should always supply a reason. For example, we should set an appropriate reason when the “kill” links are clicked in the Spark UI (see [code](https://github.com/apache/spark/blob/b14c1f036f8f394ad1903998128c05d04dd584a9/core/src/main/scala/org/apache/spark/ui/jobs/JobsTab.scala#L54C1-L55)).
    Other APIs currently lack a reason field. For example, cancelJobGroup and cancelJobsWithTag don’t provide any way to specify a reason, so we only see generic logs like “asked to cancel job group <group name>”. We should add an ability to pass in a group cancellation reason and thread that through into the scheduler’s logging and job failure reasons.
    
    This feature can be implemented in two PRs:
    
    1. Modify the current SparkContext and its downstream APIs to add the reason string, such as cancelJobGroup and cancelJobsWithTag
    
    2. Add reasons for all internal calls to these methods.
    
    **Note: This is the first of the two PRs to implement this new feature**
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, it modifies the SparkContext API, allowing users to add an optional `reason: String` to `cancelJobsWithTags` and `cancelJobGroup`, while the old methods without the `reason` are also kept. This creates a more uniform interface where the user can supply an optional reason for all job/stage cancellation calls.
    
    ### How was this patch tested?
    
    New tests are added to `JobCancellationSuite` to test the reason fields for these calls.
    
    For the API changes in R and PySpark, tests are added to these files:
    - R/pkg/tests/fulltests/test_context.R
    - python/pyspark/tests/test_pin_thread.py
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
     No
    
    Closes apache#47361 from mingkangli-db/reason_job_cancellation.
    
    Authored-by: Mingkang Li <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    mingkangli-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    cb76d36 View commit details
    Browse the repository at this point in the history
  132. [SPARK-48623][CORE] Migrate FileAppender logs to structured logging

    ### What changes were proposed in this pull request?
    This PR migrates `src/main/scala/org/apache/spark/util/logging/FileAppender.scala` to comply with the scala style changes in apache#46947
    
    ### Why are the changes needed?
    This makes development and PR review of the structured logging migration easier.
    
    ### Does this PR introduce any user-facing change?
    No
    
    ### How was this patch tested?
    Tested by ensuring dev/scalastyle checks pass
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47394 from asl3/asl3/migratenewfiles.
    
    Authored-by: Amanda Liu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    asl3 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9f5f42a View commit details
    Browse the repository at this point in the history
  133. [SPARK-48752][PYTHON][CONNECT][DOCS] Introduce pyspark.logger for i…

    …mproved structured logging for PySpark
    
    ### What changes were proposed in this pull request?
    
    This PR introduces the `pyspark.logger` module to facilitate structured client-side logging for PySpark users.
    
    This module includes a `PySparkLogger` class that provides several methods for logging messages at different levels in a structured JSON format:
    - `PySparkLogger.info`
    - `PySparkLogger.warning`
    - `PySparkLogger.error`
    
    The logger can be easily configured to write logs to either the console or a specified file.
    
    ## DataFrame error log improvement
    
    This PR also improves the DataFrame API error logs by leveraging this new logging framework:
    
    ### **Before**
    
    We introduced structured logging from apache#45729, but PySpark log is still hard to figure out in the current structured log, because it is hidden and mixed within bunch of complex JVM stacktraces and it's also not very Python-friendly:
    
    ```json
    {
      "ts": "2024-06-28T10:53:48.528Z",
      "level": "ERROR",
      "msg": "Exception in task 7.0 in stage 0.0 (TID 7)",
      "context": {
        "task_name": "task 7.0 in stage 0.0 (TID 7)"
      },
      "exception": {
        "class": "org.apache.spark.SparkArithmeticException",
        "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/.../spark/python/test_error_context.py:17\n",
        "stacktrace": [
          {
            "class": "org.apache.spark.sql.errors.QueryExecutionErrors$",
            "method": "divideByZeroError",
            "file": "QueryExecutionErrors.scala",
            "line": 203
          },
          {
            "class": "org.apache.spark.sql.errors.QueryExecutionErrors",
            "method": "divideByZeroError",
            "file": "QueryExecutionErrors.scala",
            "line": -1
          },
          {
            "class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1",
            "method": "project_doConsume_0$",
            "file": null,
            "line": -1
          },
          {
            "class": "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1",
            "method": "processNext",
            "file": null,
            "line": -1
          },
          {
            "class": "org.apache.spark.sql.execution.BufferedRowIterator",
            "method": "hasNext",
            "file": "BufferedRowIterator.java",
            "line": 43
          },
          {
            "class": "org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1",
            "method": "hasNext",
            "file": "WholeStageCodegenEvaluatorFactory.scala",
            "line": 50
          },
          {
            "class": "org.apache.spark.sql.execution.SparkPlan",
            "method": "$anonfun$getByteArrayRdd$1",
            "file": "SparkPlan.scala",
            "line": 388
          },
          {
            "class": "org.apache.spark.rdd.RDD",
            "method": "$anonfun$mapPartitionsInternal$2",
            "file": "RDD.scala",
            "line": 896
          },
          {
            "class": "org.apache.spark.rdd.RDD",
            "method": "$anonfun$mapPartitionsInternal$2$adapted",
            "file": "RDD.scala",
            "line": 896
          },
          {
            "class": "org.apache.spark.rdd.MapPartitionsRDD",
            "method": "compute",
            "file": "MapPartitionsRDD.scala",
            "line": 52
          },
          {
            "class": "org.apache.spark.rdd.RDD",
            "method": "computeOrReadCheckpoint",
            "file": "RDD.scala",
            "line": 369
          },
          {
            "class": "org.apache.spark.rdd.RDD",
            "method": "iterator",
            "file": "RDD.scala",
            "line": 333
          },
          {
            "class": "org.apache.spark.scheduler.ResultTask",
            "method": "runTask",
            "file": "ResultTask.scala",
            "line": 93
          },
          {
            "class": "org.apache.spark.TaskContext",
            "method": "runTaskWithListeners",
            "file": "TaskContext.scala",
            "line": 171
          },
          {
            "class": "org.apache.spark.scheduler.Task",
            "method": "run",
            "file": "Task.scala",
            "line": 146
          },
          {
            "class": "org.apache.spark.executor.Executor$TaskRunner",
            "method": "$anonfun$run$5",
            "file": "Executor.scala",
            "line": 644
          },
          {
            "class": "org.apache.spark.util.SparkErrorUtils",
            "method": "tryWithSafeFinally",
            "file": "SparkErrorUtils.scala",
            "line": 64
          },
          {
            "class": "org.apache.spark.util.SparkErrorUtils",
            "method": "tryWithSafeFinally$",
            "file": "SparkErrorUtils.scala",
            "line": 61
          },
          {
            "class": "org.apache.spark.util.Utils$",
            "method": "tryWithSafeFinally",
            "file": "Utils.scala",
            "line": 99
          },
          {
            "class": "org.apache.spark.executor.Executor$TaskRunner",
            "method": "run",
            "file": "Executor.scala",
            "line": 647
          },
          {
            "class": "java.util.concurrent.ThreadPoolExecutor",
            "method": "runWorker",
            "file": "ThreadPoolExecutor.java",
            "line": 1136
          },
          {
            "class": "java.util.concurrent.ThreadPoolExecutor$Worker",
            "method": "run",
            "file": "ThreadPoolExecutor.java",
            "line": 635
          },
          {
            "class": "java.lang.Thread",
            "method": "run",
            "file": "Thread.java",
            "line": 840
          }
        ]
      },
      "logger": "Executor"
    }
    
    ```
    
    ### **After**
    
    Now we can get a improved, simplified and also Python-friendly error log for DataFrame errors:
    
    ```json
    {
      "ts": "2024-06-28 19:53:48,563",
      "level": "ERROR",
      "logger": "DataFrameQueryContextLogger",
      "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/.../spark/python/test_error_context.py:17\n",
      "context": {
        "file": "/.../spark/python/test_error_context.py",
        "line_no": "17",
        "fragment": "__truediv__"
        "error_class": "DIVIDE_BY_ZERO"
      },
      "exception": {
        "class": "Py4JJavaError",
        "msg": "An error occurred while calling o52.showString.\n: org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called from\n/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22\n\n\tat org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)\n\tat org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)\n\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)\n\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)\n\tat org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)\n\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)\n\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)\n\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:333)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:146)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1007)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)\n\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:412)\n\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1051)\n\tat org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)\n\tat org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)\n\tat org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)\n\tat org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)\n\tat org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)\n\tat org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)\n\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)\n\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)\n\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)\n\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)\n\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)\n\tat org.apache.spark.sql.Dataset.head(Dataset.scala:3393)\n\tat org.apache.spark.sql.Dataset.take(Dataset.scala:3626)\n\tat org.apache.spark.sql.Dataset.getRows(Dataset.scala:294)\n\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:330)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:568)\n\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)\n\tat py4j.Gateway.invoke(Gateway.java:282)\n\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\n\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)\n\tat java.base/java.lang.Thread.run(Thread.java:840)\n",
        "stacktrace": ["Traceback (most recent call last):", "  File \"/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/errors/exceptions/captured.py\", line 272, in deco", "    return f(*a, **kw)", "  File \"/Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/py4j/protocol.py\", line 326, in get_return_value", "    raise Py4JJavaError(", "py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.", ": org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 22012", "== DataFrame ==", "\"__truediv__\" was called from", "/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22", "", "\tat org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)", "\tat org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)", "\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)", "\tat org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)", "\tat org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)", "\tat org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)", "\tat org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)", "\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)", "\tat org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)", "\tat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)", "\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)", "\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:333)", "\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)", "\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)", "\tat org.apache.spark.scheduler.Task.run(Task.scala:146)", "\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)", "\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)", "\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)", "\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)", "\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)", "\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)", "\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)", "\tat java.base/java.lang.Thread.run(Thread.java:840)", "\tat org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1007)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)", "\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)", "\tat org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)", "\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)", "\tat org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)", "\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:412)", "\tat org.apache.spark.rdd.RDD.collect(RDD.scala:1051)", "\tat org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)", "\tat org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)", "\tat org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)", "\tat org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)", "\tat org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)", "\tat org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)", "\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)", "\tat org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)", "\tat org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)", "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)", "\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)", "\tat org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)", "\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)", "\tat org.apache.spark.sql.Dataset.head(Dataset.scala:3393)", "\tat org.apache.spark.sql.Dataset.take(Dataset.scala:3626)", "\tat org.apache.spark.sql.Dataset.getRows(Dataset.scala:294)", "\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:330)", "\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)", "\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)", "\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)", "\tat java.base/java.lang.reflect.Method.invoke(Method.java:568)", "\tat py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)", "\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)", "\tat py4j.Gateway.invoke(Gateway.java:282)", "\tat py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)", "\tat py4j.commands.CallCommand.execute(CallCommand.java:79)", "\tat py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)", "\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)", "\tat java.base/java.lang.Thread.run(Thread.java:840)"]
      },
    }
    ```
    
    ### Why are the changes needed?
    
    **Before**
    
    Currently we don't have PySpark dedicated logging module so we have to manually set up and customize the Python logging module, for example:
    
    ```python
    logger = logging.getLogger("TestLogger")
    user = "test_user"
    action = "test_action"
    logger.info(f"User {user} takes an {action}")
    ```
    
    This logs an information just in a following simple string:
    
    ```
    INFO:TestLogger:User test_user takes an test_action
    ```
    
    This is not very actionable, and it is hard to analyze not since it is not well-structured.
    
    Or we can use Log4j from JVM which resulting in excessively detailed logs as described in the above example, and this way even cannot be applied to Spark Connect.
    
    **After**
    
    We can simply import and use `PySparkLogger` with minimal setup:
    
    ```python
    from pyspark.logger import PySparkLogger
    logger = PySparkLogger.getLogger("TestLogger")
    user = "test_user"
    action = "test_action"
    logger.info(f"User {user} takes an {action}", user=user, action=action)
    ```
    
    This logs an information in a following JSON format:
    
    ```json
    {
      "ts": "2024-06-28 19:44:19,030",
      "level": "WARNING",
      "logger": "TestLogger",
      "msg": "User test_user takes an test_action",
      "context": {
        "user": "test_user",
        "action": "test_action"
      },
    }
    ```
    
    **NOTE:** we can add as many keyword arguments as we want for each logging methods. These keyword arguments, such as `user` and `action` in the example, are included within the `"context"` field of the JSON log. This structure makes it easy to track and analyze the log.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No API changes, but the PySpark client-side logging is improved.
    
    Also added user-facing documentation "Logging in PySpark":
    
    <img width="1395" alt="Screenshot 2024-07-16 at 5 40 41 PM" src="https://github.com/user-attachments/assets/c77236aa-1c6f-4b5b-ad14-26ccdc474f59">
    
    Also added API reference:
    
    <img width="1417" alt="Screenshot 2024-07-16 at 5 40 58 PM" src="https://github.com/user-attachments/assets/6bb3fb23-6847-4086-8f4b-bcf9f4242724">
    
    ### How was this patch tested?
    
    Added UTs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47145 from itholic/pyspark_logger.
    
    Authored-by: Haejoon Lee <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    itholic authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    729ee6a View commit details
    Browse the repository at this point in the history
  134. [MINOR][SQL][TESTS] Enable test case testOrcAPI in `JavaDataFrameRe…

    …aderWriterSuite`
    
    ### What changes were proposed in this pull request?
    This PR enabled test case `testOrcAPI` in `JavaDataFrameReaderWriterSuite` because this test no longer depends on Hive classes, we can test it like other test cases in this Suite.
    
    ### Why are the changes needed?
    Enable test case `testOrcAPI` in `JavaDataFrameReaderWriterSuite`
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Pass GitHub Actions
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47400 from LuciferYang/minor-testOrcAPI.
    
    Authored-by: yangjie01 <[email protected]>
    Signed-off-by: yangjie01 <[email protected]>
    LuciferYang authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    82a6708 View commit details
    Browse the repository at this point in the history
  135. [SPARK-36680][SQL][FOLLOWUP] Files with options should be put into re…

    …solveDataSource function
    
    ### What changes were proposed in this pull request?
    
    When reading csv, json and other files, pass the options parameter to the rules.resolveDataSource method to make the options parameter effective.
    
    This is a bug fix for [apache#46707](apache#46707) szehon-ho
    
    ### Why are the changes needed?
    
    For the following SQL, the options parameter passed in does not take effect. This is because the rules.resolveDataSource method does not pass the options parameter during the datasource construction process
    
    ```
     SELECT * FROM csv.`/test/data.csv` WITH (`header` = true, 'delimiter' = '|')
     ```
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Unit test in SQLQuerySuite
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47370 from logze/hint-options.
    
    Authored-by: lizongze <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    logze authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b42e9b6 View commit details
    Browse the repository at this point in the history
  136. [SPARK-48388][SQL] Fix SET statement behavior for SQL Scripts

    ### What changes were proposed in this pull request?
    `SET` statement is used to set config values and it has a poorly designed grammar rule `#setConfiguration` that matches everything after `SET` - `SET .*?`. This conflicts with the usage of `SET` for setting session variables, and we needed to introduce `SET (VAR | VARIABLE)` grammar rule to make distinction between setting the config values and session variables - [SET VAR pull request](apache#40474).
    
    However, this is not by SQL standard, so for SQL scripting ([JIRA](https://issues.apache.org/jira/browse/SPARK-48338)) we are opting to disable `SET` for configs and use it only for session variables. This enables use to use only `SET` for setting values to session variables. Config values can still be set from SQL scripts using `EXECUTE IMMEDIATE`.
    
    This change simply reorders grammar rules to achieve above behavior, and alters only visitor functions where name of the rule had to be changed or completely new rule was added.
    
    ### Why are the changes needed?
    These changes are supposed to resolve the issues poorly designed `SET` statement for the case of SQL scripts.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    This PR is in a series of PRs that will introduce changes to sql() API to add support for SQL scripting, but for now, the API remains unchanged.
    In the future, the API will remain the same as well, but it will have new possibility to execute SQL scripts.
    
    ### How was this patch tested?
    Already existing tests should cover the changes.
    New tests for SQL scripts were added to:
    - `SqlScriptingParserSuite`
    - `SqlScriptingInterpreterSuite`
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Closes apache#47272 from davidm-db/sql_scripting_set_statement.
    
    Authored-by: David Milicevic <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    davidm-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    7fe5a1a View commit details
    Browse the repository at this point in the history
  137. [SPARK-48890][CORE][SS] Add Structured Streaming related fields to lo…

    …g4j ThreadContext
    
    ### What changes were proposed in this pull request?
    
    There are some special informations needed for structured streaming queries. Specifically, each query has a query_id and run_id. Also if using MicrobatchExecution (default), there is a batch_id.
    
    A (query_id, run_id, batch_id) identifies the microbatch the streaming query runs. Adding these field to a threadContext would help especially when there are multiple queries running.
    
    ### Why are the changes needed?
    
    Logging improvement
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Run a streaming query through spark-submit, here are sample logs (search for query_id, run_id, or batch_id):
    
    ```
    {"ts":"2024-07-15T19:56:01.577Z","level":"INFO","msg":"Starting new streaming query.","context":{"query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"MicroBatchExecution"}
    {"ts":"2024-07-15T19:56:01.579Z","level":"INFO","msg":"Stream started from {}","context":{"query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","streaming_offsets_start":"{}"},"logger":"MicroBatchExecution"}
    {"ts":"2024-07-15T19:56:01.602Z","level":"INFO","msg":"Writing atomically to file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0 using temp file file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp","context":{"final_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","temp_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp"},"logger":"CheckpointFileManager"}
    {"ts":"2024-07-15T19:56:01.675Z","level":"INFO","msg":"Renamed temp file file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp to file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","context":{"final_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","temp_path":"file:/private/var/folders/9k/pbxb4_690wv4smwhwbzwmqkw0000gp/T/temporary-037d26ae-0d6f-4771-9de3-d028730520e0/offsets/.0.566e3ae0-a15e-438c-82c1-26cc109746b3.tmp"},"logger":"CheckpointFileManager"}
    {"ts":"2024-07-15T19:56:01.676Z","level":"INFO","msg":"Committed offsets for batch 0. Metadata OffsetSeqMetadata(0,1721073361582,HashMap(spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider, spark.sql.streaming.stateStore.rocksdb.formatVersion -> 5, spark.sql.streaming.statefulOperator.useStrictDistribution -> true, spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion -> 2, spark.sql.streaming.multipleWatermarkPolicy -> min, spark.sql.streaming.aggregation.stateFormatVersion -> 2, spark.sql.shuffle.partitions -> 200, spark.sql.streaming.join.stateFormatVersion -> 2, spark.sql.streaming.stateStore.compression.codec -> lz4))","context":{"batch_id":"0","offset_sequence_metadata":"OffsetSeqMetadata(0,1721073361582,HashMap(spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider, spark.sql.streaming.stateStore.rocksdb.formatVersion -> 5, spark.sql.streaming.statefulOperator.useStrictDistribution -> true, spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion -> 2, spark.sql.streaming.multipleWatermarkPolicy -> min, spark.sql.streaming.aggregation.stateFormatVersion -> 2, spark.sql.shuffle.partitions -> 200, spark.sql.streaming.join.stateFormatVersion -> 2, spark.sql.streaming.stateStore.compression.codec -> lz4))","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"MicroBatchExecution"}
    {"ts":"2024-07-15T19:56:02.074Z","level":"INFO","msg":"Code generated in 97.122375 ms","context":{"batch_id":"0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4","total_time":"97.122375"},"logger":"CodeGenerator"}
    {"ts":"2024-07-15T19:56:02.125Z","level":"INFO","msg":"Start processing data source write support: MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.datasources.noop.NoopStreamingWrite$20ba1e29]. The input RDD has 1} partitions.","context":{"batch_id":"0","batch_write":"MicroBatchWrite[epoch: 0, writer: org.apache.spark.sql.execution.datasources.noop.NoopStreamingWrite$20ba1e29]","count":"1","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"WriteToDataSourceV2Exec"}
    {"ts":"2024-07-15T19:56:02.129Z","level":"INFO","msg":"Starting job: start at NativeMethodAccessorImpl.java:0","context":{"batch_id":"0","call_site_short_form":"start at NativeMethodAccessorImpl.java:0","query_id":"094ebe4a-30a3-4541-90af-ca238e4e6697","run_id":"67b161c5-83e5-430a-a905-04815a0002f4"},"logger":"SparkContext"}
    {"ts":"2024-07-15T19:56:02.135Z","level":"INFO","msg":"Got job 0 (start at NativeMethodAccessorImpl.java:0) with 1 output partitions","context":{"call_site_short_form":"start at NativeMethodAccessorImpl.java:0","job_id":"0","num_partitions":"1"},"logger":"DAGScheduler"}
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47340 from WweiL/structured-logging-streaming-id-aware.
    
    Authored-by: Wei Liu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    WweiL authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5ec284f View commit details
    Browse the repository at this point in the history
  138. [SPARK-48921][SQL] ScalaUDF encoders in subquery should be resolved f…

    …or MergeInto
    
    ### What changes were proposed in this pull request?
    
    We got a customer issue that a `MergeInto` query on Iceberg table works earlier but cannot work after upgrading to Spark 3.4.
    
    The error looks like
    
    ```
    Caused by: org.apache.spark.SparkRuntimeException: Error while decoding: org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to nullable on unresolved object
    upcast(getcolumnbyordinal(0, StringType), StringType, - root class: java.lang.String).toString.
    ```
    
    The source table of `MergeInto` uses `ScalaUDF`. The error happens when Spark invokes the deserializer of input encoder of the `ScalaUDF` and the deserializer is not resolved yet.
    
    The encoders of ScalaUDF are resolved by the rule `ResolveEncodersInUDF` which will be applied at the end of analysis phase.
    
    During rewriting `MergeInto` to `ReplaceData` query, Spark creates an `Exists` subquery and `ScalaUDF` is part of the plan of the subquery. Note that the `ScalaUDF` is already resolved by the analyzer.
    
    Then, in `ResolveSubquery` rule which resolves the subquery, it will resolve the subquery plan if it is not resolved yet. Because the subquery containing `ScalaUDF` is resolved, the rule skips it so `ResolveEncodersInUDF` won't be applied on it. So the analyzed `ReplaceData` query contains a `ScalaUDF` with encoders unresolved that cause the error.
    
    This patch modifies `ResolveSubquery` so it will resolve subquery plan if it is not analyzed to make sure subquery plan is fully analyzed.
    
    This patch moves `ResolveEncodersInUDF` rule before rewriting `MergeInto` to make sure the `ScalaUDF` in the subquery plan is fully analyzed.
    
    ### Why are the changes needed?
    
    Fixing production query error.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, fixing user-facing issue.
    
    ### How was this patch tested?
    
    Manually test with `MergeInto` query and add an unit test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47380 from viirya/fix_subquery_resolve.
    
    Lead-authored-by: Liang-Chi Hsieh <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    bba6cea View commit details
    Browse the repository at this point in the history
  139. [SPARK-48934][SS] Python datetime types converted incorrectly for set…

    …ting timeout in applyInPandasWithState
    
    ### What changes were proposed in this pull request?
    Fix the way applyInPandasWithState's setTimeoutTimestamp() handles argument of datetime
    
    ### Why are the changes needed?
    In applyInPandasWithState(), when state.setTimeoutTimestamp() is passed in with datetime.datetime type, it doesn't function as expected. Fix it.
    Also, fix another bug of reporting VALUE_NOT_POSITIVE. This issue will trigger when the converted value is 0.
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Add unit test coverage for thie scenario
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47398 from siying/state_set_timeout.
    
    Lead-authored-by: Siying Dong <[email protected]>
    Co-authored-by: Siying Dong <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    106c8d4 View commit details
    Browse the repository at this point in the history
  140. [SPARK-48495][DOCS][FOLLOW-UP] Fix Table Markdown in Shredding.md

    Minor change that shouldn't require a Jira to fix the unbalanced row in the example of Shredding.md
    
    Closes apache#47407 from RussellSpitzer/patch-1.
    
    Authored-by: Russell Spitzer <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    RussellSpitzer authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    fceb916 View commit details
    Browse the repository at this point in the history
  141. [SPARK-48915][SQL][TESTS][FOLLOWUP] Add some uncovered predicates(!=,…

    … <, <=, >, >=) for correlation in `GeneratedSubquerySuite`
    
    ### What changes were proposed in this pull request?
    
    In PR apache#47386, we improves coverage of predicate types of scalar subquery in the WHERE clause.
    Follow up, this PR as aims to add some uncovered predicates(!=, <, <=, >, >=) for correlation in `GeneratedSubquerySuite`.
    
    ### Why are the changes needed?
    
    Better coverage of current subquery tests with correlation in `GeneratedSubquerySuite`.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47399 from wayneguow/SPARK-48915_follow_up.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    16b076e View commit details
    Browse the repository at this point in the history
  142. Revert "[SPARK-47307][DOCS][FOLLOWUP] Add a migration guide for the b…

    …ehavior change of base64 function"
    
    This reverts commit b2e0a4d.
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    9460db8 View commit details
    Browse the repository at this point in the history
  143. [SPARK-48933][BUILD] Upgrade protobuf-java to 3.25.3

    ### What changes were proposed in this pull request?
    The pr aims to upgrade `protobuf-java` from `3.25.1` to `3.25.3`.
    
    ### Why are the changes needed?
    - v3.25.1 VS v.3.25.3:
      protocolbuffers/protobuf@v3.25.1...v3.25.3
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47397 from panbingkun/SPARK-48933.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    aa36786 View commit details
    Browse the repository at this point in the history
  144. [SPARK-48940][BUILD] Upgrade Arrow to 17.0.0

    ### What changes were proposed in this pull request?
    The pr aims to upgrade `arrow` from `16.1.0` to `17.0.0`.
    
    ### Why are the changes needed?
    The full release notes: https://arrow.apache.org/release/17.0.0.html
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47409 from panbingkun/SPARK-48940.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    e73b7f9 View commit details
    Browse the repository at this point in the history
  145. [SPARK-48498][SQL][FOLLOWUP] do padding for char-char comparison

    ### What changes were proposed in this pull request?
    
    This is a followup of apache#46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled.
    
    ### Why are the changes needed?
    
    bug fix if people disable read side char padding
    
    ### Does this PR introduce _any_ user-facing change?
    
    No because it's a followup and the original PR is not released yet
    
    ### How was this patch tested?
    
    new tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes apache#47412 from cloud-fan/char.
    
    Authored-by: Wenchen Fan <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    cloud-fan authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    16b32ed View commit details
    Browse the repository at this point in the history
  146. [SPARK-47307][SQL][FOLLOWUP] Promote spark.sql.legacy.chunkBase64Stri…

    …ng.enabled from a legacy/internal config to a regular/public one
    
    ### What changes were proposed in this pull request?
    
    + Promote spark.sql.legacy.chunkBase64String.enabled from a legacy/internal config to a regular/public one.
    + Add test cases for unbase64
    
    ### Why are the changes needed?
    
    Keep the same behavior as before. More details: apache#47303 (comment)
    
    ### Does this PR introduce _any_ user-facing change?
    
    yes, revert behavior change introduced in apache#47303
    
    ### How was this patch tested?
    
    existing unit test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47410 from wForget/SPARK-47307_followup.
    
    Lead-authored-by: wforget <[email protected]>
    Co-authored-by: Kent Yao <[email protected]>
    Signed-off-by: Kent Yao <[email protected]>
    2 people authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    cfe2526 View commit details
    Browse the repository at this point in the history
  147. [SPARK-48946][SQL] NPE in redact method when session is null

    ### What changes were proposed in this pull request?
    
    If we call DataSourceV2ScanExecBase redact method from a thread that don't have a session in thread local we get an NPE. Getting stringRedactionPattern from conf could prevent this problem as conf checks if session is null or not. We also use this in DataSourceScanExec trait.
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L93-L95
    
    ### Why are the changes needed?
    
    To prevent NPE when session is null.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing UTs.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47419 from mikoszilard/SPARK-48946.
    
    Authored-by: Szilard Miko <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    mikoszilard authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    e7a1105 View commit details
    Browse the repository at this point in the history
  148. [SPARK-48836] Integrate SQL schema with state schema/metadata

    ### What changes were proposed in this pull request?
    
    This PR enables TWS operator writes the "real" schema of the state variables that is initialized on the executors to be written to the `StateSchemaV3` that is being written by drivers. We'll integrate the SQL schema of the state variables with this [StateSchemaV3 implementation PR](apache#47104).
    
    ### Why are the changes needed?
    
    When reloading the state after query restart, we'll need the schema/encoder of the state variables before restart.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Integration tests in `TransformWithStateSuite` and `TransformWithStateTTLSuite` that tests with all state variable types to have correct schema. Existing integration tests in `TransformWith*State(TTL)Suite` for verifying SQL serialization is correct.
    Existing unit test suites & newly added unit suites in `ValueStateSuite`, `ListStateSuite`, `MapStateSuite`, `TimerSuite` for non-primitive types.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47257 from jingz-db/metadata-schema-compatible.
    
    Authored-by: jingz-db <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    36a3f96 View commit details
    Browse the repository at this point in the history
  149. [SPARK-48945][PYTHON] Simplify regex functions with lit

    ### What changes were proposed in this pull request?
    Simplify a group of function with `lit`
    
    ### Why are the changes needed?
    code clean up, these branchings are not necessary
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47417 from zhengruifeng/py_func_simplity_lit.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    887c645 View commit details
    Browse the repository at this point in the history
  150. [SPARK-48944][CONNECT] Unify the JSON-format schema handling in Conne…

    …ct Server
    
    ### What changes were proposed in this pull request?
    Simplify the JSON-format schema handling in Connect Server, by introducing a helper function `extractDataTypeFromJSON`
    
    ### Why are the changes needed?
    to unify the schema handling
    
    ### Does this PR introduce _any_ user-facing change?
    No, minor refactoring
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47415 from zhengruifeng/simplfy_from_json.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    4d691c9 View commit details
    Browse the repository at this point in the history
  151. [SPARK-48938][PYTHON] Improve error messages when registering Python …

    …UDTFs
    
    ### What changes were proposed in this pull request?
    
    This PR improves the error messages when registering Python UDTFs.
    Before this PR:
    ```python
    class TestUDTF:
       ...
    
    spark.udtf.register("test_udtf", TestUDTF)
    ```
    This fails with
    ```
    AttributeError: type object "TestUDTF" has no attribute "evalType"
    ```
    After this PR:
    ```python
    spark.udtf.register("test_udtf", TestUDTF)
    ```
    Now we have a nicer error:
    ```
    [CANNOT_REGISTER_UDTF] Cannot register the UDTF 'test_udtf': expected a 'UserDefinedTableFunction'. Please make sure the UDTF is correctly defined as a class, and then either wrap it in the `udtf()` function or annotate it with `udtf(...)`.`
    ```
    
    ### Why are the changes needed?
    
    To improve usability.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Existing and new unit tests.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47408 from allisonwang-db/spark-48938-udtf-register-err-msg.
    
    Authored-by: allisonwang-db <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    allisonwang-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    adba41e View commit details
    Browse the repository at this point in the history
  152. [SPARK-48592][INFRA] Add structured logging style script and GitHub w…

    …orkflow
    
    ### What changes were proposed in this pull request?
    This PR checks for Scala logging messages using logInfo, logWarning, logError and containing variables without MDC wrapper
    
    Example error output:
    ```
    [error] spark/mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:225:4
    [error] Logging message should use Structured Logging Framework style, such as log"...${MDC(TASK_ID, taskId)..."
                    Refer to the guidelines in the file `internal/Logging.scala`.
    ```
    
    ### Why are the changes needed?
    This makes development and PR review of the structured logging migration easier.
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Manual test, verified it will throw errors on invalid logging messages.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47239 from asl3/structuredlogstylescript.
    
    Authored-by: Amanda Liu <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    asl3 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    4d7d1d9 View commit details
    Browse the repository at this point in the history
  153. [SPARK-48954] try_mod() replaces try_remainder()

    ### What changes were proposed in this pull request?
    
    for consistency try_remainder() gets renamed to try_mod().
    this is Spark 4.0.0 only, so no need for config.
    
    ### Why are the changes needed?
    
    To keep consistent naming.
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes, replaces try_remainder() with try_mod()
    
    ### How was this patch tested?
    
    Existing try_remainder() tests
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes apache#47427 from srielau/SPARK-48954-try-mod.
    
    Authored-by: Serge Rielau <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
    srielau authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    a243921 View commit details
    Browse the repository at this point in the history
  154. [SPARK-48891][SS] Refactor StateSchemaCompatibilityChecker to unify a…

    …ll state schema formats
    
    ### What changes were proposed in this pull request?
    Refactor StateSchemaCompatibilityChecker to unify all state schema formats
    
    ### Why are the changes needed?
    Needed to integrate future changes around state data source reader and schema evolution and consolidate these changes
    
    - Consolidates all state schema reader/writers in one place
    - Consolidates all validation logic through the same API
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added unit tests
    
    ```
    12:38:45.481 WARN org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite:
    
    ===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.execution.streaming.state.StateSchemaCompatibilityCheckerSuite, threads: rpc-boss-3-1 (daemon=true), ForkJoinPool.commonPool-worker-3 (daemon=true), ForkJoinPool.commonPool-worker-2 (daemon=true), shuffle-boss-6-1 (daemon=true), ForkJoinPool.commonPool-worker-1 (daemon=true) =====
    [info] Run completed in 12 seconds, 565 milliseconds.
    [info] Total number of tests run: 30
    [info] Suites: completed 1, aborted 0
    [info] Tests: succeeded 30, failed 0, canceled 0, ignored 0, pending 0
    [info] All tests passed.
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47359 from anishshri-db/task/SPARK-48891.
    
    Authored-by: Anish Shrigondekar <[email protected]>
    Signed-off-by: Jungtaek Lim <[email protected]>
    anishshri-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    b8ce429 View commit details
    Browse the repository at this point in the history
  155. [SPARK-48849][SS]Create OperatorStateMetadataV2 for the TransformWith…

    …StateExec operator
    ericm-db authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    1130d69 View commit details
    Browse the repository at this point in the history
  156. [SPARK-48955][SQL] ArrayCompact's datatype should be `containsNull …

    …= false`
    
    ### What changes were proposed in this pull request?
    `ArrayCompact`'s datatype should be `containsNull = false`
    
    ### Why are the changes needed?
    `ArrayCompact` - Removes null values from the array
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Added test
    
    before:
    ```
    scala> val df = spark.range(1).select(lit(Array(1,2,3)).alias("a"))
    val df: org.apache.spark.sql.DataFrame = [a: array<int>]
    
    scala> df.printSchema
    warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
    root
     |-- a: array (nullable = false)
     |    |-- element: integer (containsNull = true)
    
    scala> df.select(array_compact(col("a"))).printSchema
    warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
    root
     |-- array_compact(a): array (nullable = false)
     |    |-- element: integer (containsNull = true)
    ```
    
    after
    ```
    scala> df.select(array_compact(col("a"))).printSchema
    warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
    root
     |-- array_compact(a): array (nullable = false)
     |    |-- element: integer (containsNull = false)
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47430 from zhengruifeng/sql_array_compact_data_type.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    8a13602 View commit details
    Browse the repository at this point in the history
  157. [MINOR][PYTHON] Fix type hint for from_utc_timestamp and `to_utc_ti…

    …mestamp`
    
    ### What changes were proposed in this pull request?
    Fix type hint for `from_utc_timestamp` and `to_utc_timestamp`
    
    ### Why are the changes needed?
    the str type input should be treated as literal string, instead of column name
    
    ### Does this PR introduce _any_ user-facing change?
    doc change
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47429 from zhengruifeng/py_fix_hint_202407.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    039148f View commit details
    Browse the repository at this point in the history
  158. [MINOR][DOCS] Fix some typos in LZFBenchmark

    ### What changes were proposed in this pull request?
    
    This RP aims to fix some typos in `LZFBenchmark`.
    
    ### Why are the changes needed?
    
    Fix typos and avoid confusion.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47435 from wayneguow/lzf.
    
    Authored-by: Wei Guo <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    wayneguow authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    d3b85eb View commit details
    Browse the repository at this point in the history
  159. [SPARK-48962][INFRA] Make the input parameters of `workflows/benchmar…

    …k` selectable
    
    ### What changes were proposed in this pull request?
    The pr aims to make the `input parameters` of `workflows/benchmark` selectable.
    
    ### Why are the changes needed?
    - Before:
      <img width="311" alt="image" src="https://github.com/user-attachments/assets/da93ea8f-8791-4816-a5d9-f82c018fa819">
    
    - After:
      https://github.com/panbingkun/spark/actions/workflows/benchmark.yml
      <img width="318" alt="image" src="https://github.com/user-attachments/assets/0b9b01a0-96f6-4630-98d9-7d2709aafcd0">
    
    ### Does this PR introduce _any_ user-facing change?
    Yes, Convenient for developers to run `workflows/benchmark`, transforming input values from only `tex`t to `selectable values`.
    
    ### How was this patch tested?
    Manually test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47438 from panbingkun/improve_workflow_dispatch.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    0fd23bf View commit details
    Browse the repository at this point in the history
  160. [SPARK-48941][PYTHON][ML] Replace RDD read / write API invocation wit…

    …h Dataframe read / write API
    
    ### What changes were proposed in this pull request?
    
    PysparkML: Replace RDD read / write API invocation with Dataframe read / write API
    
    ### Why are the changes needed?
    
    Follow-up of apache#47341
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Unit test.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes apache#47411 from WeichenXu123/SPARK-48909-follow-up.
    
    Authored-by: Weichen Xu <[email protected]>
    Signed-off-by: Weichen Xu <[email protected]>
    WeichenXu123 authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    20064b8 View commit details
    Browse the repository at this point in the history
  161. [SPARK-48959][SQL] Make NoSuchNamespaceException extend `NoSuchData…

    …baseException` to restore the exception handling
    
    ### What changes were proposed in this pull request?
    Make `NoSuchNamespaceException` extend `NoSuchNamespaceException`
    
    ### Why are the changes needed?
    1, apache#47276 made many SQL commands throw `NoSuchNamespaceException` instead of `NoSuchDatabaseException`, it is more then an end-user facing change, it is a breaking change which break the exception handling in 3-rd libraries in the eco-system.
    
    2, `NoSuchNamespaceException` and `NoSuchDatabaseException` actually share the same error class `SCHEMA_NOT_FOUND`
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    CI
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No
    
    Closes apache#47433 from zhengruifeng/make_nons_nodb.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
    zhengruifeng authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    df3885d View commit details
    Browse the repository at this point in the history
  162. [SPARK-48963][INFRA] Support JIRA_ACCESS_TOKEN in translate-contrib…

    …utors.py
    
    ### What changes were proposed in this pull request?
    
    Support JIRA_ACCESS_TOKEN in translate-contributors.py
    
    ### Why are the changes needed?
    
    Remove plaintext password in JIRA_PASSWORD environment variable  to prevent password leakage
    ### Does this PR introduce _any_ user-facing change?
    no, infra only
    
    ### How was this patch tested?
    
    Ran translate-contributors.py with 3.5.2 RC
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    no
    
    Closes apache#47440 from yaooqinn/SPARK-48963.
    
    Authored-by: Kent Yao <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    yaooqinn authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    4b60813 View commit details
    Browse the repository at this point in the history
  163. [SPARK-48958][BUILD] Upgrade zstd-jni to 1.5.6-4

    ### What changes were proposed in this pull request?
    The pr aims to upgrade `zstd-jni` from `1.5.6-3` to `1.5.6-4`.
    
    ### Why are the changes needed?
    1.v1.5.6-3 VS v1.5.6-4
    luben/zstd-jni@v1.5.6-3...v1.5.6-4
    
    ### Does this PR introduce _any_ user-facing change?
    No.
    
    ### How was this patch tested?
    Pass GA.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    No.
    
    Closes apache#47432 from panbingkun/SPARK-48958.
    
    Authored-by: panbingkun <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
    panbingkun authored and jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    f042b7d View commit details
    Browse the repository at this point in the history
  164. a working version, draft

    jingz-db committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    7898b72 View commit details
    Browse the repository at this point in the history