Skip to content

Latest commit

 

History

History
371 lines (360 loc) · 38.7 KB

CHANGELOG.md

File metadata and controls

371 lines (360 loc) · 38.7 KB

Change log

Generated on 2024-10-31

Release 24.10

Features

#11525 [FEA] If dump always is enabled dump before decoding the file
#11461 [FEA] Support non-UTC timezone for casting from date to timestamp
#11445 [FEA] Support format 'yyyyMMdd' in GetTimestamp operator
#11442 [FEA] Add in support for setting row group sizes for parquet
#11330 [FEA] Add companion metrics for all nsTiming metrics to measure time elapsed excluding semaphore wait
#5223 [FEA] Support array_join
#10968 [FEA] support min_by function
#10437 [FEA] Add Spark 3.5.2 snapshot support

Performance

#10799 [FEA] Optimize count distinct performance optimization with null columns reuse and post expand coalesce
#8301 [FEA] semaphore prioritization
#11234 Explore swapping build table for left outer joins
#11263 [FEA] Cluster/pack multi_get_json_object paths by common prefixes

Bugs Fixed

#11558 [BUG] test_sortmerge_join_ridealong fails on DB 13.3
#11573 [BUG] very long tail task is observed when many tasks are contending for PrioritySemaphore
#11367 [BUG] Error "table_view.cpp:36: Column size mismatch" when using approx_percentile on a string column
#11543 [BUG] test_yyyyMMdd_format_for_legacy_mode[DATAGEN_SEED=1727619674, TZ=UTC] failed GPU and CPU are not both null
#11500 [BUG] dataproc serverless Integration tests failing in json_matrix_test.py
#11384 [BUG] "rs. shuffle write time" negative values seen in app history log
#11509 [BUG] buildall no longer works
#11501 [BUG] test_yyyyMMdd_format_for_legacy_mode failed in Dataproc Serverless integration tests
#11502 [BUG] IT script failed get jars as we stop deploying intermediate jars since 24.10
#11479 [BUG] spark400 build failed do not conform to class UnaryExprMeta's type parameter
#8558 [BUG] from_json generated inconsistent result comparing with CPU for input column with nested json strings
#11485 [BUG] Integration tests failing in join_test.py
#11481 [BUG] non-utc integration tests failing in json_test.py
#10911 from_json: when input is a bad json string, rapids would throw an exception.
#10457 [BUG] ScanJson and JsonToStructs allow unquoted control chars by default
#10479 [BUG] JsonToStructs and ScanJson should return null for non-numeric, non-boolean non-quoted strings
#10534 [BUG] Need Improved JSON Validation
#11436 [BUG] Mortgage unit tests fail with RAPIDS shuffle manager
#11437 [BUG] array and map casts to string tests failed
#11463 [BUG] hash_groupby_approx_percentile failed assert is None
#11465 [BUG] java.lang.NoClassDefFoundError: org/apache/spark/BuildInfo$ in non-databricks environment
#11359 [BUG] a couple of arithmetic_ops_test.py cases failed mismatching cpu and gpu values with [DATAGEN_SEED=1723985531, TZ=UTC, INJECT_OOM]
#11392 [AUDIT] Handle IgnoreNulls Expressions for Window Expressions
#10770 [BUG] Slow/no progress with cascaded pandas udfs/mapInPandas in Databricks
#11397 [BUG] We should not be using copyWithBooleanColumnAsValidity unless we can prove it is 100% safe
#11372 [BUG] spark400 failed compiling datagen_2.13
#11364 [BUG] Missing numRows in the ColumnarBatch created in GpuBringBackToHost
#11350 [BUG] spark400 compile failed in scala213
#11346 [BUG] databrick nightly failing with not able to get spark-version-info.properties
#9604 [BUG] Delta Lake metadata query detection can trigger extra file listing jobs
#11318 [BUG] GPU query is case sensitive on Hive text table's column name
#10596 [BUG] ScanJson and JsonToStructs does not deal with escaped single quotes properly
#10351 [BUG] test_from_json_mixed_types_list_struct failed
#11294 [BUG] binary-dedupe leaves around a copy of "unshimmed" class files in spark-shared
#11183 [BUG] Failed to split an empty string with error "ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorInvalidDevice: invalid device ordinal"
#11008 Fix tests failures in ast_test.py
#11265 [BUG] segfaults seen in cuDF after prefetch calls intermittently
#11025 Fix tests failures in date_time_test.py
#11065 [BUG] Spark Connect Server (3.5.1) Can Not Running Correctly

PRs

#11676 Fix race condition with Parquet filter pushdown modifying shared hadoop Configuration
#11626 Update latest changelog [skip ci]
#11624 Update the download link [skip ci]
#11577 Update latest changelog [skip ci]
#11576 Update rapids JNI and private dependency to 24.10.0
#11582 [DOC] update doc for 24.10 release [skip ci]
#11588 backport fixes of #11573 to branch 24.10
#11569 Have "dump always" dump input files before trying to decode them
#11567 Fix test case unix_timestamp(col, 'yyyyMMdd') failed for Africa/Casablanca timezone and LEGACY mode
#11496 Update test now that code is fixed
#11548 Fix negative rs. shuffle write time
#11545 Update test case related to LEACY datetime format to unblock nightly CI
#11515 Propagate default DIST_PROFILE_OPT profile to Maven in buildall
#11497 Update from_json to use new cudf features
#11516 Deploy all submodules for default sparkver in nightly [skip ci]
#11484 Fix FileAlreadyExistsException in LORE dump process
#11457 GPU device watermark metrics
#11507 Replace libmamba-solver with mamba command [skip ci]
#11503 Download artifacts via wget [skip ci]
#11490 Use UnaryLike instead of UnaryExpression
#10798 Optimizing Expand+Aggregate in sqls with many count distinct
#11366 Enable parquet suites from Spark UT
#11477 Install cuDF-py against python 3.10 on Databricks
#11462 Support non-UTC timezone for casting from date type to timestamp type
#11449 Support yyyyMMdd in GetTimestamp operator for LEGACY mode
#11456 Enable tests for all JSON white space normalization
#11483 Use reusable auto-merge workflow [skip ci]
#11482 Fix a json test for non utc time zone
#11464 Use improved CUDF JSON validation
#11474 Enable tests after string_split was fixed
#11473 Revert "Skip test_hash_groupby_approx_percentile byte and double test…
#11466 Replace scala.util.Try with a try statement in the DBR buildinfo
#11469 Skip test_hash_groupby_approx_percentile byte and double tests tempor…
#11429 Fixed some of the failing parquet_tests
#11455 Log DBR BuildInfo
#11451 xfail array and map cast to string tests
#11331 Add companion metrics for all nsTiming metrics without semaphore
#11421 [DOC] remove the redundant archive link [skip ci]
#11308 Dynamic Shim Detection for build Process
#11427 Update CI scripts to work with the "Dynamic Shim Detection" change [skip ci]
#11425 Update signoff usage [skip ci]
#11420 Add in array_join support
#11418 stop using copyWithBooleanColumnAsValidity
#11411 Fix asymmetric join crash when stream side is empty
#11395 Fix a Pandas UDF slowness issue
#11371 Support MinBy and MaxBy for non-float ordering
#11399 stop using copyWithBooleanColumnAsValidity
#11389 prevent duplicate queueing in the prio semaphore
#11291 Add distinct join support for right outer joins
#11396 Drop cudf-py python 3.9 support [skip ci]
#11393 Revert work-around for empty split-string
#11334 Add support for Spark 3.5.2
#11388 JSON tests for corrected date, timestamp, and mixed types
#11375 Fix spark400 build in datagen and tests
#11376 Create a PrioritySemaphore to back the GpuSemaphore
#11383 Fix nightly snapshots being downloaded in premerge build
#11368 Move SparkRapidsBuildInfoEvent to its own file
#11329 Change reference to MapUtils into JSONUtils
#11365 Set numRows for the ColumnBatch created in GpuBringBackToHost
#11363 Fix failing test compile for Spark 4.0.0
#11362 Add tests for repeated JSON columns/keys
#11321 conform dependency list in 341db to previous versions style
#10604 Add string escaping JSON tests to the test_json_matrix
#11328 Swap build side for outer joins when natural build side is explosive
#11358 Fix download doc [skip ci]
#11357 Fix auto merge conflict 11354 [skip ci]
#11347 Revert "Fix the mismatching default configs in integration tests (#11283)"
#11323 replace inputFiles with location.rootPaths.toString
#11340 Audit script - Check commits from sql-hive directory [skip ci]
#11283 Fix the mismatching default configs in integration tests
#11327 Make hive column matches not case-sensitive
#11324 Append ustcfy to blossom-ci whitelist [skip ci]
#11325 Fix auto merge conflict 11317 [skip ci]
#11319 Update passing JSON tests after list support added in CUDF
#11307 Safely close multiple resources in RapidsBufferCatalog
#11313 Fix auto merge conflict 10845 11310 [skip ci]
#11312 Add jihoonson as an authorized user for blossom-ci [skip ci]
#11302 Fix display issue of lore.md
#11301 Skip deploying non-critical intermediate artifacts [skip ci]
#11299 Enable get_json_object by default and remove legacy version
#11289 Use the new chunked API from multi-get_json_object
#11295 Remove redundant classes from the dist jar and unshimmed list
#11284 Use distinct count to estimate join magnification factor
#11288 Move easy unshimmed classes to sql-plugin-api
#11285 Remove files under tools/generated_files/spark31* [skip ci]
#11280 Asynchronously copy table data to the host during shuffle
#11258 Explicitly disable ANSI mode for ast_test.py
#11267 Update the rapids JNI and private dependency version to 24.10.0-SNAPSHOT
#11241 Auto merge PRs to branch-24.10 from branch-24.08 [skip ci]
#11231 Cache dependencies for scala 2.13 [skip ci]

Release 24.08

Features

#9259 [FEA] Create Spark 4.0.0 shim and build env
#10366 [FEA] It would be nice if we could support Hive-style write bucketing table
#10987 [FEA] Implement lore framework to support all operators.
#11087 [FEA] Support regex pattern with brackets when rewrite to PrefixRange patten in rlike
#22 [FEA] Add support for bucketed writes
#9939 [FEA] GpuInsertIntoHiveTable supports parquet format

Performance

#8750 [FEA] Rework GpuSubstringIndex to use cudf::slice_strings
#7404 [FEA] explore a hash agg passthrough on partial aggregates
#10976 Rewrite `pattern1

Bugs Fixed

#11287 [BUG] String split APIs on empty string produce incorrect result
#11270 [BUG] test_regexp_replace[DATAGEN_SEED=1722297411, TZ=UTC] hanging there forever in pre-merge CI intermittently
#9682 [BUG] Casting FLOAT64 to DECIMAL(12,7) produces different rows from Apache Spark CPU
#10809 [BUG] cast(9.95 as decimal(3,1)), actual: 9.9, expected: 10.0
#11266 [BUG] test_broadcast_hash_join_constant_keys failed in databricks runtimes
#11243 [BUG] ArrayIndexOutOfBoundsException on a left outer join
#11030 Fix tests failures in string_test.py
#11245 [BUG] mvn verify for the source-javadoc fails and no pre-merge check catches it
#11223 [BUG] Remove unreferenced CUDF_VER=xxx in the CI script
#11114 [BUG] Update nightly tests for Scala 2.13 to use JDK 17 only
#11229 [BUG] test_delta_name_column_mapping_no_field_ids fails on Spark
#11031 Fix tests failures in multiple files
#10948 Figure out why MapFromArrays appears in the tests for hive parquet write
#11018 Fix tests failures in hash_aggregate_test.py
#11173 [BUG] The rs. serialization time metric is misleading
#11017 Fix tests failures in url_test.py
#11201 [BUG] Delta Lake tables with name mapping can throw exceptions on read
#11175 [BUG] Clean up unused and duplicated 'org/roaringbitmap' folder in the spark3xx shims
#11196 [BUG] pipeline failed due to class not found exception: NoClassDefFoundError: com/nvidia/spark/rapids/GpuScalar
#11189 [BUG] regression in NDS after PR #11170
#11167 [BUG] UnsupportedOperationException during delta write with optimize()
#11172 [BUG] get_json_object returns wrong output with wildcard path
#11148 [BUG] Integration test test_write_hive_bucketed_table fails
#11155 [BUG] ArrayIndexOutOfBoundsException in BatchWithPartitionData.splitColumnarBatch
#11152 [BUG] LORE dumping consumes too much memory.
#11029 Fix tests failures in subquery_test.py
#11150 [BUG] hive_parquet_write_test.py::test_insert_hive_bucketed_table failure
#11070 [BUG] numpy2 fail fastparquet cases: numpy.dtype size changed
#11136 UnaryPositive expression doesn't extend UnaryExpression
#11122 [BUG] UT MetricRange failed 651070526 was not less than 1.5E8 in spark313
#11119 [BUG] window_function_test.py::test_window_group_limits_fallback_for_row_number fails in a distributed environment
#11023 Fix tests failures in dpp_test.py
#11026 Fix tests failures in map_test.py
#11020 Fix tests failures in grouping_sets_test.py
#11113 [BUG] Update premerge tests for Scala 2.13 to use JDK 17 only
#11027 Fix tests failures in sort_test.py
#10775 [BUG] Issues found by Spark UT Framework on RapidsStringExpressionsSuite
#11033 [BUG] CICD failed a case: cmp_test.py::test_empty_filter[>]
#11103 [BUG] UCX Shuffle With scala.MatchError
#11007 Fix tests failures in array_test.py
#10801 [BUG] JDK17 nightly build after Spark UT Framework is merged
#11019 Fix tests failures in window_function_test.py
#11063 [BUG] op time for GpuCoalesceBatches is more than actual
#11006 Fix test failures in arithmetic_ops_test.py
#10995 Fallback TimeZoneAwareExpression that only support UTC with zoneId instead of timeZone config
#8652 [BUG] array_item test failures on Spark 3.3.x
#11053 [BUG] Build on Databricks 330 fails
#10925 Concat cannot accept no parameter
#10975 [BUG] regex ^.*literal cannot be rewritten as contains(literal) for multiline strings
#10956 [BUG] hive_parquet_write_test.py: test_write_compressed_parquet_into_hive_table integration test failures
#10772 [BUG] Issues found by Spark UT Framework on RapidsDataFrameAggregateSuite
#10986 [BUG]Cast from string to float using hand-picked values failed in CastOpSuite
#10972 Spark 4.0 compile errors
#10794 [BUG] Incorrect cast of string columns containing various infinity notations with trailing spaces
#10964 [BUG] Improve stability of pre-merge jenkinsfile
#10714 Signature changed for PythonUDFRunner.writeUDFs
#10712 [AUDIT] BatchScanExec/DataSourceV2Relation to group splits by join keys if they differ from partition keys
#10673 [AUDIT] Rename plan nodes for PythonMapInArrowExec
#10710 [AUDIT] uncacheTableOrView changed in CommandUtils
#10711 [AUDIT] Match DataSourceV2ScanExecBase changes to groupPartitions method
#10669 Supporting broadcast of multiple filtering keys in DynamicPruning

PRs

#11400 [DOC] update notes in download page for the decompressing gzip issue [skip ci]
#11355 Update changelog for the v24.08 release [skip ci]
#11353 Update download doc for v24.08.1 [skip ci]
#11352 Update version to 24.08.1-SNAPSHOT [skip ci]
#11337 Update changelog for the v24.08 release [skip ci]
#11335 Fix Delta Lake truncation of min/max string values
#11304 Update changelog for v24.08.0 release [skip ci]
#11303 Update rapids JNI and private dependency to 24.08.0
#11296 [DOC] update doc for 2408 release [skip CI]
#11309 [Doc ]Update lore doc about the range [skip ci]
#11292 Add work around for string split with empty input.
#11278 Fix formatting of advanced configs doc
#10917 Adopt changes from JNI for casting from float to decimal
#11269 Revert "upgrade ucx to 1.17.0"
#11260 Mitigate intermittent test_buckets and shuffle_smoke_test OOM issue
#11268 Fix degenerate conditional nested loop join detection
#11244 Fix ArrayIndexOutOfBoundsException on join counts with constant join keys
#11259 CI Docker to support integration tests with Rocky OS + jdk17 [skip ci]
#11247 Fix string_test.py errors on Spark 4.0
#11246 Rework Maven Source Plugin Skip
#11149 Rework on substring index
#11236 Remove the unused vars from the version-def CI script
#11237 Fork jvm for maven-source-plugin
#11200 Multi-get_json_object
#11230 Skip test where Delta Lake may not be fully compatible with Spark
#11220 Avoid failing spark bug SPARK-44242 while generate run_dir
#11226 Fix auto merge conflict 11212
#11129 Spark 4: Fix miscellaneous tests including logic, repart, hive_delimited.
#11163 Support MapFromArrays on GPU
#11219 Fix hash_aggregate_test.py to run with ANSI enabled
#11186 from_json Json to Struct Exception Logging
#11180 More accurate estimation for the result serialization time in RapidsShuffleThreadedWriterBase
#11194 Fix ANSI mode test failures in url_test.py
#11202 Fix read from Delta Lake table with name column mapping and missing Parquet IDs
#11185 Fix multi-release jar problem
#11144 Build the Scala2.13 dist jar with JDK17
#11197 Fix class not found error: com/nvidia/spark/rapids/GpuScalar
#11191 Fix dynamic pruning regression in GpuFileSourceScanExec
#10994 Add Spark 4.0.0 Build Profile and Other Supporting Changes
#11192 Append new authorized user to blossom-ci whitelist [skip ci]
#11179 Allow more expressions to be tiered
#11141 Enable some Rapids config in RapidsSQLTestsBaseTrait for Spark UT
#11170 Avoid listFiles or inputFiles on relations with static partitioning
#11159 Drop spark31x shims
#10951 Case when performance improvement: reduce the copy_if_else
#11165 Fix some GpuBroadcastToRowExec by not dropping columns
#11126 Coalesce batches after a logical coalesce operation
#11164 fix the bucketed write error for non-utc cases
#11132 Add deletion vector metrics for low shuffle merge.
#11156 Fix batch splitting for partition column size on row-count-only batches
#11153 Fix LORE dump oom.
#11102 Fix ANSI mode failures in subquery_test.py
#11151 Fix the test error of the bucketed write for the non-utc case
#11147 upgrade ucx to 1.17.0
#11138 Update fastparquet to 2024.5.0 for numpy2 compatibility
#11137 Handle the change for UnaryPositive now extending RuntimeReplaceable
#11094 Add HiveHash support on GPU
#11139 Improve MetricsSuite to allow more gc jitter
#11133 Fix test_window_group_limits_fallback
#11097 Fix miscellaneous integ tests for Spark 4
#11118 Fix issue with DPP and AQE on reused broadcast exchanges
#11043 Dataproc serverless test fixes
#10965 Profiler: Disable collecting async allocation events by default
#11117 Update Scala2.13 premerge CI against JDK17
#11084 Introduce LORE framework.
#11099 Spark 4: Handle ANSI mode in sort_test.py
#11115 Fix match error in RapidsShuffleIterator.scala [scala2.13]
#11088 Support regex patterns with brackets when rewriting to PrefixRange pattern in rlike.
#10950 Add a heuristic to skip second or third agg pass
#11048 Fixed array_tests for Spark 4.0.0
#11049 Fix some cast_tests for Spark 4.0.0
#11066 Replaced spark3xx-common references to spark-shared
#11083 Exclude a case based on JDK version in Spark UT
#10997 Fix some test issues in Spark UT and keep RapidsTestSettings update-to-date
#11073 Disable ANSI mode for window function tests
#11076 Improve the diagnostics for 'conv' fallback explain
#11092 Add GpuBucketingUtils shim to Spark 4.0.0
#11062 fix duplicate counted metrics like op time for GpuCoalesceBatches
#11044 Fixed Failing tests in arithmetic_ops_tests for Spark 4.0.0
#11086 upgrade blossom-ci actions version [skip ci]
#10957 Support bucketing write for GPU
#10979 [FEA] Introduce low shuffle merge.
#10996 Fallback non-UTC TimeZoneAwareExpression with zoneId
#11072 Workaround numpy2 failed fastparquet compatibility tests
#11046 Calculate parallelism to speed up pre-merge CI
#11054 fix flaky array_item test failures
#11051 [FEA] Increase parallelism of deltalake test on databricks
#10993 binary-dedupe changes for Spark 4.0.0
#11060 Add in the ability to fingerprint JSON columns
#11059 Revert "Add in the ability to fingerprint JSON columns (#11002)" [skip ci]
#11039 Concat() Exception bug fix
#11002 Add in the ability to fingerprint JSON columns
#10977 Rewrite multiple literal choice regex to multiple contains in rlike
#11035 Fix auto merge conflict 11034 [skip ci]
#11040 Append new authorized user to blossom-ci whitelist [skip ci]
#11036 Update blossom-ci ACL to secure format [skip ci]
#11032 Fix a hive write test failure for Spark 350
#10998 Improve log to print more lines in build [skip ci]
#10992 Addressing the Named Parameter change in Spark 4.0.0
#10943 Fix Spark UT issues in RapidsDataFrameAggregateSuite
#10963 Add rapids configs to enable GPU running in Spark UT
#10978 More compilation fixes for Spark 4.0.0
#10953 Speed up the integration tests by running them in parallel on the Databricks cluster
#10958 Fix a hive write test failure
#10970 Move Support for RaiseError to a Shim Excluding Spark 4.0.0
#10966 Add default value for REF of premerge jenkinsfile to avoid bad overwritten [skip ci]
#10959 Add new ID to blossom-ci allow list [skip ci]
#10952 Add shims to take care of the signature change for writeUDFs in PythonUDFRunner
#10931 Add Support for Renaming of PythonMapInArrow
#10949 Change dependency version to 24.08.0-SNAPSHOT
#10857 [Spark 4.0] Account for PartitionedFileUtil.splitFiles signature change.
#10912 GpuInsertIntoHiveTable supports parquet format
#10863 [Spark 4.0] Account for CommandUtils.uncacheTableOrView signature change.
#10944 Added Shim for BatchScanExec to Support Spark 4.0
#10946 Unarchive Spark test jar for spark.read(ability)
#10945 Add Support for Multiple Filtering Keys for Subquery Broadcast
#10871 Add classloader diagnostics to initShuffleManager error message
#10933 Fixed Databricks build
#10929 Append new authorized user to blossom-ci whitelist [skip ci]

Older Releases

Changelog of older releases can be found at docs/archives