Skip to content

Latest commit

 

History

History
412 lines (403 loc) · 39.7 KB

CHANGELOG.md

File metadata and controls

412 lines (403 loc) · 39.7 KB

Change log

Generated on 2020-09-18

Release 0.2

Features

#696 [FEA] run integration tests against SPARK-3.0.1
#455 [FEA] Support UCX shuffle with optimized AQE
#510 [FEA] Investigate libcudf features needed to support struct schema pruning during loads
#541 [FEA] Scala UDF:Support for null Value operands
#542 [FEA] Scala UDF: Support for Date and Time
#499 [FEA] disable any kind of warnings about ExecutedCommandExec not being on the GPU
#540 [FEA] Scala UDF: Support for String replaceFirst()
#340 [FEA] widen the rendered Jekyll pages
#602 [FEA] don't release with any -SNAPSHOT dependencies
#579 [FEA] Auto-merge between branches
#515 [FEA] Write tests for AQE skewed join optimization
#452 [FEA] Update HashSortOptimizerSuite to work with AQE
#454 [FEA] Update GpuCoalesceBatchesSuite to work with AQE enabled
#354 [FEA]Spark 3.1 FileSourceScanExec adds parameter optionalNumCoalescedBuckets
#566 [FEA] Add support for StringSplit with an array index.
#524 [FEA] Add GPU specific metrics to GpuFileSourceScanExec
#494 [FEA] Add some AQE-specific tests to the PySpark test suite
#146 [FEA] Python tests should support running with Adaptive Query Execution enabled
#465 [FEA] Audit: Update script to audit multiple versions of Spark
#488 [FEA] Ability to limit total GPU memory used
#70 [FEA] Support StringSplit
#403 [FEA] Add in support for GetArrayItem
#493 [FEA] Implement shuffle optimization when AQE is enabled
#500 [FEA] Add maven profiles for testing with AQE on or off
#471 [FEA] create a formal process for updating the github-pages branch
#233 [FEA] Audit DataWritingCommandExec
#240 [FEA] Audit Api validation script follow on - Optimize StringToTypeTag
#388 [FEA] Audit WindowExec
#425 [FEA] Add tests for configs in BatchScan Readers
#453 [FEA] Update HashAggregatesSuite to work with AQE
#184 [FEA] Enable NoScalaDoc scalastyle rule
#438 [FEA] Enable StringLPad
#232 [FEA] Audit SortExec
#236 [FEA] Audit ShuffleExchangeExec
#355 [FEA] Support Multiple Spark versions in the same jar
#385 [FEA] Support RangeExec on the GPU
#317 [FEA] Write test wrapper to run SQL queries via pyspark
#235 [FEA] Audit BroadcastExchangeExec
#234 [FEA] Audit BatchScanExec
#238 [FEA] Audit ShuffledHashJoinExec
#237 [FEA] Audit BroadcastHashJoinExec
#316 [FEA] Add some basic Dataframe tests for CoalesceExec
#145 [FEA] Scala tests should support running with Adaptive Query Execution enabled
#231 [FEA] Audit ProjectExec
#229 [FEA] Audit FileSourceScanExec

Performance

#326 [DISCUSS] Shuffle read-side error handling
#601 [FEA] Optimize unnecessary sorts when replacing SortAggregate
#333 [FEA] Better handling of reading lots of small Parquet files
#511 [FEA] Connect shuffle table compression to shuffle exec metrics
#15 [FEA] Multiple threads shareing the same GPU
#272 [DOC] Getting started guide for UCX shuffle

Bugs Fixed

#780 [BUG] Inner Join dropping data with bucketed Table input
#569 [BUG] left_semi_join operation is abnormal and serious time-consuming
#744 [BUG] TPC-DS query 6 now produces incorrect results.
#718 [BUG] GpuBroadcastHashJoinExec ArrayIndexOutOfBoundsException
#698 [BUG] batch coalesce can fail to appear between columnar shuffle and subsequent columnar operation
#658 [BUG] GpuCoalesceBatches collectTime metric can be underreported
#59 [BUG] enable tests for string literals in a select
#486 [BUG] GpuWindowExec does not implement requiredChildOrdering
#631 [BUG] Rows are dropped when AQE is enabled in some cases
#671 [BUG] Databricks hash_aggregate_test fails trying to canonicalize a WrappedAggFunction
#218 [BUG] Window function COUNT(x) includes null-values, when it shouldn't
#153 [BUG] Incorrect output from partial-only hash aggregates with multiple distincts and non-distinct functions
#656 [BUG] integration tests produce hive metadata files
#607 [BUG] Fix misleading "cannot run on GPU" warnings when AQE is enabled
#630 [BUG] GpuCustomShuffleReader metrics always show zero rows/batches output
#643 [BUG] race condition while registering a buffer and spilling at the same time
#606 [BUG] Multiple scans for same data source with TPC-DS query59 with delta format
#626 [BUG] parquet_test showing leaked memory buffer
#155 [BUG] Incorrect output from averages with filters in partial only mode
#277 [BUG] HashAggregateSuite failure when AQE is enabled
#276 [BUG] GpuCoalesceBatchSuite failure when AQE is enabled
#598 [BUG] Non-deterministic output from MapOutputTracker.getStatistics() with AQE on GPU
#192 [BUG] test_read_merge_schema fails on Databricks
#341 [BUG] Document compression formats for readers/writers
#587 [BUG] Spark3.1 changed FileScan which means or GpuScans need to be added to shim layer
#362 [BUG] Implement getReaderForRange in the RapidsShuffleManager
#528 [BUG] HashAggregateSuite "Avg Distinct with filter" no longer valid when testing against Spark 3.1.0
#416 [BUG] Fix Spark 3.1.0 integration tests
#556 [BUG] NPE when removing shuffle
#553 [BUG] GpuColumnVector build warnings from raw type access
#492 [BUG] Re-enable AQE integration tests
#275 [BUG] TpchLike query 2 fails when AQE is enabled
#508 [BUG] GpuUnion publishes metrics on the UI that are all 0
#269 Needed to add --conf spark.driver.extraClassPath=
#473 [BUG] PartMerge:countDistinct:sum fails sporadically
#531 [BUG] Temporary RMM workaround needs to be removed
#532 [BUG] NPE when enabling shuffle manager
#525 [BUG] GpuFilterExec reports incorrect nullability of output in some cases
#483 [BUG] Multiple scans for the same parquet data source
#382 [BUG] Spark3.1 StringFallbackSuite regexp_replace null cpu fall back test fails.
#489 [FEA] Fix Spark 3.1 GpuHashJoin since it now requires CodegenSupport
#441 [BUG] test_broadcast_nested_loop_join_special_case fails on databricks
#347 [BUG] Failed to read Parquet file generated by GPU-enabled Spark.
#433 InSet operator produces an error for Strings
#144 [BUG] spark.sql.legacy.parquet.datetimeRebaseModeInWrite is ignored
#323 [BUG] GpuBroadcastNestedLoopJoinExec can fail if there are no columns
#356 [BUG] Integration cache test for BroadcastNestedLoopJoin failure
#280 [BUG] Full Outer Join does not work on nullable keys
#149 [BUG] Spark driver fails to load native libs when running on node without CUDA

PRs

#793 Update Jenkins scripts for release
#798 Fix shims provider override config not being seen by executors
#785 Make shuffle run on CPU if we do a join where we read from bucketed table
#765 Add config to override shims provider class
#759 Add CHANGELOG for release 0.2
#758 Skip the udf test fails periodically.
#752 Fix snapshot plugin jar version in docs
#751 Correct the channel for cudf installation
#754 Filter nulls from joins where possible to improve performance
#732 Add a timeout for RapidsShuffleIterator to prevent jobs to hang infin…
#637 Documentation changes for 0.2 release
#747 Disable udf tests that fail periodically
#745 Revert Null Join Filter
#741 Fix issue with parquet partitioned reads
#733 Remove GPU Types from github
#720 Stop removing GpuCoalesceBatches from non-AQE queries when AQE is enabled
#729 Fix collect time metric in CoalesceBatches
#640 Support running Pandas UDFs on GPUs in Python processes.
#721 Add some more checks to databricks build scripts
#714 Move spark 3.0.1-shims out of snapshot-shims
#711 fix blossom checkout repo
#709 [BUG] fix unexpected indentation issue in blossom yml
#642 Init workflow for blossom-ci
#705 Enable configuration check for cast string to timestamp
#702 Update slack channel for Jenkins builds
#701 fix checkout-ref for automerge
#695 Fix spark-3.0.1 shim to be released
#668 refactor automerge to support merge for protected branch
#687 Include the UDF compiler in the dist jar
#689 Change shims dependency to spark-3.0.1
#677 Use multi-threaded parquet read with small files
#638 Add Parquet-based cache serializer
#613 Enable UCX + AQE
#684 Enable test for literal string values in a select
#686 Remove sorts when replacing sort aggregate if possible
#675 Added TimeAdd
#645 [window] Add GpuWindowExec requiredChildOrdering
#676 fixUpJoinConsistency rule now works when AQE is enabled
#683 Fix issues with cannonicalization of WrappedAggFunction
#682 Fix path to start-slave.sh script in docs
#673 Increase build timeouts on nightly and premerge builds
#648 add signoff-check use github actions
#593 Add support for isNaN and datetime related instructions in UDF compiler
#666 [window] Disable GPU for COUNT(exp) queries
#655 Implement AQE unit test for InsertAdaptiveSparkPlan
#614 Fix for aggregation with multiple distinct and non distinct functions
#657 Fix verify build after integration tests are run
#660 Add in neverReplaceExec and several rules for it
#639 BooleanType test shouldn't xfail
#652 Mark UVM config as internal until supported
#653 Move to the cudf-0.15 release
#647 Improve warnings about AQE nodes not supported on GPU
#646 Stop reporting zero metrics for GpuCustomShuffleReader
#644 Small fix for race in catalog where a buffer could get spilled while …
#623 Fix issues with canonicalization
#599 [FEA] changelog generator
#563 cudf and spark version info in artifacts
#633 Fix leak if RebaseHelper throws during Parquet read
#632 Copy function isSearchableType from Spark because signature changed in 3.0.1
#583 Add udf compiler unit tests
#617 Documentation updates for branch 0.2
#616 Add config to reserve GPU memory
#612 [REVIEW] Fix incorrect output from averages with filters in partial only mode
#609 fix minor issues with instructions for building ucx
#611 Added in profile to enable shims for SNAPSHOT releases
#595 Parquet small file reading optimization
#582 fix #579 Auto-merge between branches
#536 Add test for skewed join optimization when AQE is enabled
#603 Fix data size metric always 0 when using RAPIDS shuffle
#600 Fix calculation of string data for compressed batches
#597 Remove the xfail for parquet test_read_merge_schema on Databricks
#591 Add ucx license in NOTICE-binary
#596 Add Spark 3.0.2 to Shim layer
#594 Filter nulls from joins where possible to improve performance.
#590 Move GpuParquetScan/GpuOrcScan into Shim
#588 xfail the tpch spark 3.1.0 tests that fail
#572 Update buffer store to return compressed batches directly, add compression NVTX ranges
#558 Fix unit tests when AQE is enabled
#580 xfail the Spark 3.1.0 integration tests that fail
#565 Minor improvements to TPC-DS benchmarking code
#567 Explicitly disable AQE in one test
#571 Fix Databricks shim layer for GpuFileSourceScanExec and GpuBroadcastExchangeExec
#564 Add GPU decode time metric to scans
#562 getCatalog can be called from the driver, and can return null
#555 Fix build warnings for ColumnViewAccess
#560 Fix databricks build for AQE support
#557 Fix tests failing on Spark 3.1
#547 Add GPU metrics to GpuFileSourceScanExec
#462 Implement optimized AQE support so that exchanges run on GPU where possible
#550 Document Parquet and ORC compression support
#539 Update script to audit multiple Spark versions
#543 Add metrics to GpuUnion operator
#549 Move spark shim properties to top level pom
#497 Add UDF compiler implementations
#487 Add framework for batch compression of shuffle partitions
#544 Add in driverExtraClassPath for standalone mode docs
#546 Fix Spark 3.1.0 shim build error in GpuHashJoin
#537 Use fresh SparkSession when capturing to avoid late capture of previous query
#538 Revert "Temporary workaround for RMM initial pool size bug (#530)"
#517 Add config to limit maximum RMM pool size
#527 Add support for split and getArrayIndex
#534 Fixes bugs around GpuShuffleEnv initialization
#529 [BUG] Degenerate table metas were not getting copied to the heap
#530 Temporary workaround for RMM initial pool size bug
#526 Fix bug with nullability reporting in GpuFilterExec
#521 Fix typo with databricks shim classname SparkShimServiceProvider
#522 Use SQLConf instead of SparkConf when looking up SQL configs
#518 Fix init order issue in GpuShuffleEnv when RAPIDS shuffle configured
#514 Added clarification of RegExpReplace, DateDiff, made descriptive text consistent
#506 Add in basic support for running tpcds like queries
#504 Add ability to ignore tests depending on spark shim version
#503 Remove unused async buffer spill support
#501 disable codegen in 3.1 shim for hash join
#466 Optimize and fix Api validation script
#481 Codeowners
#439 Check a PR has been committed using git signoff
#319 Update partitioning logic in ShuffledBatchRDD
#491 Temporarily ignore AQE integration tests
#490 Fix Spark 3.1.0 build for HashJoin changes
#482 Prevent bad practice in python tests
#485 Show plan in assertion message if test fails
#480 Fix link from README to getting-started.md
#448 Preliminary support for keeping broadcast exchanges on GPU when AQE is enabled
#478 Fall back to CPU for binary as string in parquet
#477 Fix special case joins in broadcast nested loop join
#469 Update HashAggregateSuite to work with AQE
#475 Udf compiler pom followup
#434 Add UDF compiler skeleton
#474 Re-enable noscaladoc check
#461 Fix comments style to pass scala style check
#468 fix broken link
#456 Add closeOnExcept to clean up code that closes resources only on exceptions
#464 Turn off noscaladoc rule until codebase is fixed
#449 Enforce NoScalaDoc rule in scalastyle checks
#450 Enable scalastyle for shuffle plugin
#451 Databricks remove unneeded files and fix build to not fail on rm when file missing
#442 Shim layer support for Spark 3.0.0 Databricks
#447 Add scalastyle plugin to shim module
#426 Update BufferMeta to support multiple codec buffers per table
#440 Run mortgage test both with AQE on and off
#445 Added in StringRPad and StringLPad
#422 Documentation updates
#437 Fix bug with InSet and Strings
#435 Add in checks for Parquet LEGACY date/time rebase
#432 Fix batch use-after-close in partitioning, shuffle env init
#423 Fix duplicates includes in assembly jar
#418 CI Add unit tests running for Spark 3.0.1
#421 Make it easier to run TPCxBB benchmarks from spark shell
#413 Fix download link
#414 Shim Layer to support multiple Spark versions
#406 Update cast handling to deal with new libcudf casting limitations
#405 Change slave->worker
#395 Databricks doc updates
#401 Extended the FAQ
#398 Add tests for GpuPartition
#352 Change spark tgz package name
#397 Fix small bug in ShuffleBufferCatalog.hasActiveShuffle
#286 [REVIEW] Updated join tests for cache
#393 Contributor license agreement
#389 Added in support for RangeExec
#390 Ucx getting started
#391 Hide slack channel in Jenkins scripts
#387 Remove the term whitelist
#365 [REVIEW] Timesub tests
#383 Test utility to compare SQL query results between CPU and GPU
#380 Fix databricks notebook link
#378 Added in FAQ and fixed spelling
#377 Update heading in configs.md
#373 Modifying branch name to conform with rapidsai branch name change
#376 Add our session extension correctly if there are other extensions configured
#374 Fix rat issue for notebooks
#364 Update Databricks patch for changes to GpuSortMergeJoin
#371 fix typo and use regional bucket per GCP's update
#359 Karthik changes
#353 Fix broadcast nested loop join for the no column case
#313 Additional tests for broadcast hash join
#342 Implement build-side rules for shuffle hash join
#349 Updated join code to treat null equality properly
#335 Integration tests on spark 3.0.1-SNAPSHOT & 3.1.0-SNAPSHOT
#346 Update the Title Header for Fine Tuning
#344 Fix small typo in readme
#331 Adds iterator and client unit tests, and prepares for more fetch failure handling
#337 Fix Scala compile phase to allow Java classes referencing Scala classes
#332 Match GPU overwritten functions with SQL functions from FunctionRegistry
#339 Fix databricks build
#338 Move GpuPartitioning to a separate file
#310 Update release Jenkinsfile for Databricks
#330 Hide private info in Jenkins scripts
#324 Add in basic support for GpuCartesianProductExec
#328 Enable slack notification for Databricks build
#321 update databricks patch for GpuBroadcastNestedLoopJoinExec
#322 Add oss.sonatype.org to download the cudf jar
#320 Don't mount passwd/group to the container
#258 Enable running TPCH tests with AQE enabled
#318 Build docker image with Dockerfile
#309 Update databricks patch to latest changes
#312 Trigger branch-0.2 integration test
#307 [Jenkins] Update the release script and Jenkinsfile
#304 [DOC][Minor] Fix typo in spark config name.
#303 Update compatibility doc for -0.0 issues
#301 Add info about branches in README.md
#296 Added in basic support for broadcast nested loop join
#297 Databricks CI improvements and support runtime env parameter to xfail certain tests
#292 Move artifacts version in version-def.sh
#254 Cleanup QA tests
#289 Clean up GpuCollectLimitMeta and add in metrics
#287 Add in support for right join and fix issues build right
#273 Added releases to the README.md
#285 modify run_pyspark_from_build.sh to be bash 3 friendly
#281 Add in support for Full Outer Join on non-null keys
#274 Add RapidsDiskStore tests
#259 Add RapidsHostMemoryStore tests
#282 Update Databricks patch for 0.2 branch
#261 Add conditional xfail test for DISTINCT aggregates with NaN
#263 More time ops
#256 Remove special cases for contains, startsWith, and endWith
#253 Remove GpuAttributeReference and GpuSortOrder
#271 Update the versions for 0.2.0 properly for the databricks build
#162 Integration tests for corner cases in window functions.
#264 Add a local mvn repo for nightly pipeline
#262 Refer to branch-0.2
#255 Revert change to make dependencies of shaded jar optional
#257 Fix link to RAPIDS cudf in index.md
#252 Update to 0.2.0-SNAPSHOT and cudf-0.15-SNAPSHOT

Release 0.1

Features

#74 [FEA] Support ToUnixTimestamp
#21 [FEA] NormalizeNansAndZeros
#105 [FEA] integration tests for equi-joins

Bugs Fixed

#116 [BUG] calling replace with a NULL throws an exception
#168 [BUG] GpuUnitTests Date tests leak column vectors
#209 [BUG] Developers section in pom need to be updated
#204 [BUG] Code coverage docs are out of date
#154 [BUG] Incorrect output from partial-only averages with nulls
#61 [BUG] Cannot disable Parquet, ORC, CSV reading when using FileSourceScanExec

PRs

#249 Compatability -> Compatibility
#247 Add index.md for default doc page, fix table formatting for configs
#241 Let default branch to master per the release rule
#177 Fixed leaks in unit test and use ColumnarBatch for testing
#243 Jenkins file for Databricks release
#225 Make internal project dependencies optional for shaded artifact
#242 Add site pages
#221 Databricks Build Support
#215 Remove CudfColumnVector
#213 Add RapidsDeviceMemoryStore tests
#214 [REVIEW] Test failure to pass Attribute as GpuAttribute
#211 Add project leads to pom developer list
#210 Updated coverage docs
#195 Support public release for plugin jar
#208 Remove unneeded comment from pom.xml
#191 WindowExec handle different spark distributions
#181 Remove INCOMPAT for NormalizeNanAndZero, KnownFloatingPointNormalized
#196 Update Spark dependency to the released 3.0.0 artifacts
#206 Change groupID to 'com.nvidia' in IT scripts
#202 Fixed issue for contains when searching for an empty string
#201 Fix name of scan
#200 Fix issue with GpuAttributeReference not overrideing references
#197 Fix metrics for writes
#186 Fixed issue with nullability on concat
#193 Add RapidsBufferCatalog tests
#188 rebrand to com.nvidia instead of ai.rapids
#189 Handle AggregateExpression having resultIds parameter instead of a single resultId
#190 FileSourceScanExec can have logicalRelation parameter on some distributions
#185 Update type of parameter of GpuExpandExec to make it consistent
#172 Merge qa test to integration test
#180 Add MetaUtils unit tests
#171 Cleanup scaladoc warnings about missing links
#176 Updated join tests to cover more data.
#169 Remove dependency on shaded Spark artifact
#174 Added in fallback tests
#165 Move input metadata tests to pyspark
#173 Fix setting local mode for tests
#160 Integration tests for normalizing NaN/zeroes.
#163 Ignore the order locally for repartition tests
#157 Add partial and final only hash aggregate tests and fix nulls corner case for Average
#159 Add integration tests for joins
#158 Orc merge schema fallback and FileScan format configs
#164 Fix compiler warnings
#152 Moved cudf to 0.14 for CI
#151 Switch CICD pipelines to Github