forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 2.3 Merge #97
Open
ymahajan
wants to merge
142
commits into
snappy/branch-2.3
Choose a base branch
from
spark_2.3_merge
base: snappy/branch-2.3
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Spark 2.3 Merge #97
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down. To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface. Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence, 1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend. 2. Added functionality of killing all the running tasks in an executor. ExternalClusterManagerSuite.scala was added to test this patch. Author: Hemant Bhanawat <[email protected]> Closes apache#11723 from hbhanawat/pluggableScheduler. Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala core/src/main/scala/org/apache/spark/executor/Executor.scala core/src/main/scala/org/apache/spark/scheduler/ExternalClusterManager.scala core/src/test/resources/META-INF/services/org.apache.spark.scheduler.ExternalClusterManager core/src/test/scala/org/apache/spark/scheduler/ExternalClusterManagerSuite.scala dev/.rat-excludes
…se newly added ExternalClusterManager With the addition of ExternalClusterManager(ECM) interface in PR apache#11723, any cluster manager can now be integrated with Spark. It was suggested in ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function into YarnClusterManager that implements ECM interface. Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too. Author: Hemant Bhanawat <[email protected]> Closes apache#12641 from hbhanawat/yarnClusterMgr. Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala
…alType (#38) Created a jira for -Pspark precheckin SNAP-914 (https://jira.snappydata.io/browse/SNAP-914)
…ocation is alive Conflicts: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Conflicts: python/pyspark/shell.py
…ction * increase visibility of complex type write methods (that perform code generation) in GenerateUnsafeProjection to allow using from outside * allow for internal types (ArrayData, MapData, InternalRow) directly in Row->InternalRow conversions in CatalystTypeConverters for complex types Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
Fixing TakeOrderedAndProject that contains Sequence of Expression in an option, but spark plan expression transformation used to skip it as it does not handle sequence in an Option. Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
- adding a non-secure version for random UUID adapted from Android UUID.java - use the same for file name in DiskBlockManager, Utils methods, WriteAheadLogBackedBlockRDD Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala Conflicts: core/src/main/scala/org/apache/spark/util/Utils.scala streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
* Added a method to bump up the expr id counter by a given number, so as to reserve the ExprID * Optimizing the Declarative aggregate function to have predictable input buffer aggregte attribute reference by using reservation in the ExprID being generated * Changes to minimize the query plan size for bootstrap and some more optimizations which aids in perf improvement * fixed scala style failures Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
- Adding gradle build scripts with gradle wrapper invocation for all projects/subprojects - Added product target to pack snappy-spark distribution (like dev/make-distribution.sh) - Changes to make it compatible with top-level SnappyData build - Add SnappyData modification headers to remaining modified files - Fixed compilation and few test issues
- Allow registration of output streams on active StreamingContext - Added flag to DStream.initialize to allow initialization of newly added output streams with zeroTime - generatedRDDs made thread-safe
Allow for non-integral values as executionId (SnappyData uses full DistributedMember representation)
- in addition to properties with "spark." prefix, also accept "snappydata." prefix in system properties, spark submit/shell Conflicts: launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java
- make AbstractDataType.classTag as lazy to optimize DataType object creation (e.g. DecimalType) - allow for more than 16 bytes in serialized Decimal objects (precision has been increased to 127 in SnappyData) - minor updates to build.gradle's including a proper dependency for generateBuildInfo to avoid its re-run (and thus all dependent projects) every time - add modification headers for touched files (and remove from a couple of old ones which have later been reverted) - updating dependencies as per latest merge from branch-2.0 - fix scalastyle errors Conflicts: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
* Using the scala collections compiler was not able to differentiate between the scala and java api * Minor changes for snappy implementation of executor This is required as we need to have a classloader that also looks into the snappy store for the classes. * Revert "Using the scala collections" This reverts commit c2ab0c5. Conflicts: core/src/main/scala/org/apache/spark/executor/Executor.scala
Conflicts: core/src/main/scala/org/apache/spark/Partitioner.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
- for all cases of implicit casts, convert to date or timestamp values instead of string when one side is a string - likewise when one side is a timestamp and other date then both are being converted to string; now convert date to timestamp Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
- added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup" added for initial aggregation buffer values - use OpenHashMap in DictionaryEncoding which is faster than normal hash map; added clear methods to OpenHashMap/OpenHashSet for reuse - minor correction in the string in HiveUtils Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
Used by the new benchmark from the PR adapted for SnappyData for its vectorized implementation. Build updated to set testOutput and other variables instead of appending to existing values (causes double append with both snappydata build adding and this adding for its tests)
- don't apply numBuckets in Shuffle partitioning since Shuffle cannot create a compatible partitioning with matching numBuckets (only numPartitions) - check numBuckets too in HashPartitioning compatibility Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala
use a future for enforcing timeout (2 x configured value) in netty RPC transfers after which the channel will be closed and fail Conflicts: core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
ashetkar
pushed a commit
that referenced
this pull request
Apr 5, 2018
* Parse results of minikube status more rigorously Prior code assumes the minikubeVM status line is always the first row output from minikube status, and it is not when the version upgrade notifier prints an upgrade suggestion message. * Also filter ip response to expected rows
This was referenced Apr 5, 2018
reverting flag check optimization in Logging to be compatible with upstream Spark Conflicts: core/src/main/scala/org/apache/spark/MapOutputTracker.scala core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala core/src/main/scala/org/apache/spark/internal/Logging.scala core/src/main/scala/org/apache/spark/storage/BlockManager.scala core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
…kSession This is to allow override by SnappySession extensions. Conflicts: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala sql/core/src/test/scala/org/apache/spark/sql/DatasetSerializerRegistratorSuite.scala sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
also added MemoryMode in MemoryPool warning message Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
avoid materializing it immediately (for point queries that won't use it) Conflicts: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
The standalone cluster should support unique application names. As they are user visible and easy to track user can write scripts to kill applications by names. Also, added support to kill Spark applications by names(case insensitive). Conflicts: core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
Handled join order in optimization phase. Also removed custom changes in HashPartition. We won't store bucket information in HashPartitioning. Instead based on the flag "linkPartitionToBucket" we can determine the number of partitions to be either numBuckets or num cores assigned to the executor. Reverted changes related to numBuckets in Snappy Spark. Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Spark 2.3 merge
How was this patch tested?
Precheckin
Other PRs