Spark 2.3 Merge #97

ymahajan · 2018-03-08T08:25:27Z

What changes were proposed in this pull request?

Spark 2.3 merge

How was this patch tested?

Precheckin

Other PRs

Snappydata - Spark 2.3 Merge snappydata#984
Spark - Spark 2.3 Merge #97
Jobserver - Spark 2.3 Merge snappy-spark-jobserver#17
AQP - https://github.com/SnappyDataInc/snappy-aqp/pull/166
Store - Spark 2.3 Merge snappy-store#381

This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down. To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface. Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence, 1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend. 2. Added functionality of killing all the running tasks in an executor. ExternalClusterManagerSuite.scala was added to test this patch. Author: Hemant Bhanawat <[email protected]> Closes apache#11723 from hbhanawat/pluggableScheduler. Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/ExternalClusterManager.scala core/src/test/resources/META-INF/services/org.apache.spark.scheduler.ExternalClusterManager core/src/test/scala/org/apache/spark/scheduler/ExternalClusterManagerSuite.scala dev/.rat-excludes

…se newly added ExternalClusterManager With the addition of ExternalClusterManager(ECM) interface in PR apache#11723, any cluster manager can now be integrated with Spark. It was suggested in ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function into YarnClusterManager that implements ECM interface. Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too. Author: Hemant Bhanawat <[email protected]> Closes apache#12641 from hbhanawat/yarnClusterMgr. Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

…alType (#38) Created a jira for -Pspark precheckin SNAP-914 (https://jira.snappydata.io/browse/SNAP-914)

…ocation is alive Conflicts: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

Conflicts: python/pyspark/shell.py

…ction * increase visibility of complex type write methods (that perform code generation) in GenerateUnsafeProjection to allow using from outside * allow for internal types (ArrayData, MapData, InternalRow) directly in Row->InternalRow conversions in CatalystTypeConverters for complex types Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

Fixing TakeOrderedAndProject that contains Sequence of Expression in an option, but spark plan expression transformation used to skip it as it does not handle sequence in an Option. Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

- adding a non-secure version for random UUID adapted from Android UUID.java - use the same for file name in DiskBlockManager, Utils methods, WriteAheadLogBackedBlockRDD Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala

* Added a method to bump up the expr id counter by a given number, so as to reserve the ExprID * Optimizing the Declarative aggregate function to have predictable input buffer aggregte attribute reference by using reservation in the ExprID being generated * Changes to minimize the query plan size for bootstrap and some more optimizations which aids in perf improvement * fixed scala style failures Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala

- Adding gradle build scripts with gradle wrapper invocation for all projects/subprojects - Added product target to pack snappy-spark distribution (like dev/make-distribution.sh) - Changes to make it compatible with top-level SnappyData build - Add SnappyData modification headers to remaining modified files - Fixed compilation and few test issues

- Allow registration of output streams on active StreamingContext - Added flag to DStream.initialize to allow initialization of newly added output streams with zeroTime - generatedRDDs made thread-safe

Allow for non-integral values as executionId (SnappyData uses full DistributedMember representation)

- in addition to properties with "spark." prefix, also accept "snappydata." prefix in system properties, spark submit/shell Conflicts: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java

- make AbstractDataType.classTag as lazy to optimize DataType object creation (e.g. DecimalType) - allow for more than 16 bytes in serialized Decimal objects (precision has been increased to 127 in SnappyData) - minor updates to build.gradle's including a proper dependency for generateBuildInfo to avoid its re-run (and thus all dependent projects) every time - add modification headers for touched files (and remove from a couple of old ones which have later been reverted) - updating dependencies as per latest merge from branch-2.0 - fix scalastyle errors Conflicts: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

* Using the scala collections compiler was not able to differentiate between the scala and java api * Minor changes for snappy implementation of executor This is required as we need to have a classloader that also looks into the snappy store for the classes. * Revert "Using the scala collections" This reverts commit c2ab0c5. Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala

Conflicts: core/src/main/scala/org/apache/spark/Partitioner.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala

- for all cases of implicit casts, convert to date or timestamp values instead of string when one side is a string - likewise when one side is a timestamp and other date then both are being converted to string; now convert date to timestamp Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

- added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup" added for initial aggregation buffer values - use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse - minor correction in the string in HiveUtils Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

Used by the new benchmark from the PR adapted for SnappyData for its vectorized implementation. Build updated to set testOutput and other variables instead of appending to existing values (causes double append with both snappydata build adding and this adding for its tests)

- don't apply numBuckets in Shuffle partitioning since Shuffle cannot create a compatible partitioning with matching numBuckets (only numPartitions) - check numBuckets too in HashPartitioning compatibility Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

use a future for enforcing timeout (2 x configured value) in netty RPC transfers after which the channel will be closed and fail Conflicts: core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala

* Parse results of minikube status more rigorously Prior code assumes the minikubeVM status line is always the first row output from minikube status, and it is not when the version upgrade notifier prints an upgrade suggestion message. * Also filter ip response to expected rows

reverting flag check optimization in Logging to be compatible with upstream Spark Conflicts: core/src/main/scala/org/apache/spark/MapOutputTracker.scala core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala core/src/main/scala/org/apache/spark/internal/Logging.scala core/src/main/scala/org/apache/spark/storage/BlockManager.scala core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala

…kSession This is to allow override by SnappySession extensions. Conflicts: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala sql/core/src/test/scala/org/apache/spark/sql/DatasetSerializerRegistratorSuite.scala sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala

also added MemoryMode in MemoryPool warning message Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

avoid materializing it immediately (for point queries that won't use it) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

The standalone cluster should support unique application names. As they are user visible and easy to track user can write scripts to kill applications by names. Also, added support to kill Spark applications by names(case insensitive). Conflicts: core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala

Handled join order in optimization phase. Also removed custom changes in HashPartition. We won't store bucket information in HashPartitioning. Instead based on the flag "linkPartitionToBucket" we can determine the number of partitions to be either numBuckets or num cores assigned to the executor. Reverted changes related to numBuckets in Snappy Spark. Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala

Hemant Bhanawat and others added 30 commits February 21, 2018 14:17

[SNAPPYDATA] increasing visibility of SparkContext.activeContext

e9f80e6

[SNAPPYDATA] add SnappyData's modification headers in updated files

009ab91

[SNAP-404] Address #comment about increasing decimal precision

8998a02

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

[SNAP-860] Removed hardcoding of size of Array used for storing Decim…

cc40bc9

…alType (#38) Created a jira for -Pspark precheckin SNAP-914 (https://jira.snappydata.io/browse/SNAP-914)

[SNAPPYDATA] Try hard to not schedule on others if ExecutorCacheTaskL…

79130cb

…ocation is alive Conflicts: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

[SNAPPYDATA] Use SnappyContext as default SQLContext on shell (#35)

596ac89

Conflicts: python/pyspark/shell.py

[SNAPPYDATA] Adding SnappyData modification headers for missing files

a6d7679

[SNAPPYDATA] Updated README.md with information on SnappyData's changes

b1dafd5

[SNAPPYDATA] Dynamic CQ changes in spark streaming

ecb7f7f

- Allow registration of output streams on active StreamingContext - Added flag to DStream.initialize to allow initialization of newly added output streams with zeroTime - generatedRDDs made thread-safe

[SNAPPYDATA] Fix cluster startup due to executionId format

40a4ea0

Allow for non-integral values as executionId (SnappyData uses full DistributedMember representation)

[SNAPPYDATA] Accept Spark properties with "snappydata." prefix

37ee514

- in addition to properties with "spark." prefix, also accept "snappydata." prefix in system properties, spark submit/shell Conflicts: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java

[SNAPPYDATA] fix a scalaStyle issue

27025d1

Conflicts: core/src/main/scala/org/apache/spark/Partitioner.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala

[SNAPPYDATA] Spark version 2.0.1-2

1c4ff5a

[SNAPPYDATA] fixing antlr generated code for IDEA

5acb359

[SNAPPYDATA] Spark version 2.0.1-3

5cacaa1

[SNAPPYDATA] updating snappy-spark version after the merge

2142c81

[SNAPPYDATA] Bumping version to 2.0.3-1

ce30bd9

rishitesh and others added 3 commits March 7, 2018 23:59

Preserve the preferred location in MapPartitionRDD. (#92)

6532714

* SnappyData Spark Version 2.1.1.2

b35bd98

[SNAP-2218] honour timeout in netty RPC transfers (#93)

55aa6e3

use a future for enforcing timeout (2 x configured value) in netty RPC transfers after which the channel will be closed and fail Conflicts: core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala

ymahajan mentioned this pull request Mar 12, 2018

Spark 2.3 Merge TIBCOSoftware/snappydata#984

Open

ymahajan added 9 commits March 12, 2018 13:52

Adding commons crypto libraries

5db16fe

compilation issues

787deb8

fixing compilation errors

39fb136

compilation errors

1cebe77

compilation issues

7940bbb

compilation issues

ad82beb

Added new build.gradle for spark-kvstore subproject

11523f2

Addressing precheckin failures

b41d4f8

Addressing precheckin issues

51d65e4

This was referenced Apr 5, 2018

Spark 2.3 Merge TIBCOSoftware/snappy-store#381

Open

Spark 2.3 Merge TIBCOSoftware/snappy-spark-jobserver#17

Open

ymahajan and others added 14 commits April 5, 2018 17:18

Addressing precheckin failures

e6e3859

Addressing precheckin failures

5ddf8e8

Addressing prechekin failures

b1a7664

Addressing precheckin failures

4b275d8

Addressing precheckin failures

7c936df

Addressing precheckin failures

11a26a6

Addressing precheckin failures

9552470

[SNAPPYDATA] increased default codegen cache size to 2K

38c4f16

also added MemoryMode in MemoryPool warning message Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

[SNAPPYDATA] make Dataset.boundEnc as lazy val

6863e44

avoid materializing it immediately (for point queries that won't use it) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

Fixing issues after master down merge

58d93d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 2.3 Merge #97

Spark 2.3 Merge #97

ymahajan commented Mar 8, 2018 •

edited

Loading

Spark 2.3 Merge #97

Are you sure you want to change the base?

Spark 2.3 Merge #97

Conversation

ymahajan commented Mar 8, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Other PRs

ymahajan commented Mar 8, 2018 •

edited

Loading