Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 2.3 Merge #97

Open
wants to merge 142 commits into
base: snappy/branch-2.3
Choose a base branch
from
Open

Spark 2.3 Merge #97

wants to merge 142 commits into from

Conversation

ymahajan
Copy link

@ymahajan ymahajan commented Mar 8, 2018

What changes were proposed in this pull request?

Spark 2.3 merge

How was this patch tested?

Precheckin

Other PRs

Hemant Bhanawat and others added 30 commits February 21, 2018 14:17
This commit adds support for pluggable cluster manager. And also allows a cluster manager to clean up tasks without taking the parent process down.

To plug a new external cluster manager, ExternalClusterManager trait should be implemented. It returns task scheduler and backend scheduler that will be used by SparkContext to schedule tasks. An external cluster manager is registered using the java.util.ServiceLoader mechanism (This mechanism is also being used to register data sources like parquet, json, jdbc etc.). This allows auto-loading implementations of ExternalClusterManager interface.

Currently, when a driver fails, executors exit using system.exit. This does not bode well for cluster managers that would like to reuse the parent process of an executor. Hence,

  1. Moving system.exit to a function that can be overriden in subclasses of CoarseGrainedExecutorBackend.
  2. Added functionality of killing all the running tasks in an executor.

ExternalClusterManagerSuite.scala was added to test this patch.

Author: Hemant Bhanawat <[email protected]>

Closes apache#11723 from hbhanawat/pluggableScheduler.

Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
	core/src/main/scala/org/apache/spark/executor/Executor.scala
	core/src/main/scala/org/apache/spark/scheduler/ExternalClusterManager.scala
	core/src/test/resources/META-INF/services/org.apache.spark.scheduler.ExternalClusterManager
	core/src/test/scala/org/apache/spark/scheduler/ExternalClusterManagerSuite.scala
	dev/.rat-excludes
…se newly added ExternalClusterManager

With the addition of ExternalClusterManager(ECM) interface in PR apache#11723, any cluster manager can now be integrated with Spark. It was suggested in  ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function  into YarnClusterManager that implements ECM interface.

Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too.

Author: Hemant Bhanawat <[email protected]>

Closes apache#12641 from hbhanawat/yarnClusterMgr.

Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/test/scala/org/apache/spark/SparkContextSchedulerCreationSuite.scala
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala
…ocation is alive

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
…ction

  * increase visibility of complex type write methods (that perform code generation) in GenerateUnsafeProjection
    to allow using from outside
  * allow for internal types (ArrayData, MapData, InternalRow) directly in Row->InternalRow conversions
    in CatalystTypeConverters for complex types

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala
Fixing TakeOrderedAndProject that contains Sequence of Expression in an option, but spark plan
expression transformation used to skip it as it does not handle sequence in an Option.

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala
- adding a non-secure version for random UUID adapted from Android UUID.java
- use the same for file name in DiskBlockManager, Utils methods, WriteAheadLogBackedBlockRDD

Conflicts:
	core/src/main/scala/org/apache/spark/util/Utils.scala

Conflicts:
	core/src/main/scala/org/apache/spark/util/Utils.scala
	streaming/src/main/scala/org/apache/spark/streaming/rdd/WriteAheadLogBackedBlockRDD.scala
* Added a method to bump up the expr id counter by a given number, so as to reserve the ExprID
* Optimizing the Declarative aggregate function to have predictable input buffer aggregte attribute reference by using reservation in the ExprID being generated
* Changes to minimize the query plan size for bootstrap and some more optimizations which aids in perf improvement
* fixed scala style failures

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala
- Adding gradle build scripts with gradle wrapper invocation for all projects/subprojects
- Added product target to pack snappy-spark distribution (like dev/make-distribution.sh)
- Changes to make it compatible with top-level SnappyData build
- Add SnappyData modification headers to remaining modified files
- Fixed compilation and few test issues
- Allow registration of output streams on active StreamingContext
- Added flag to DStream.initialize to allow initialization of newly
  added output streams with zeroTime
- generatedRDDs made thread-safe
Allow for non-integral values as executionId (SnappyData uses full DistributedMember representation)
- in addition to properties with "spark." prefix, also accept "snappydata."
  prefix in system properties, spark submit/shell

Conflicts:
	launcher/src/main/java/org/apache/spark/launcher/SparkLauncher.java
- make AbstractDataType.classTag as lazy to optimize DataType object
  creation (e.g. DecimalType)
- allow for more than 16 bytes in serialized Decimal objects
  (precision has been increased to 127 in SnappyData)
- minor updates to build.gradle's including a proper dependency for generateBuildInfo
  to avoid its re-run (and thus all dependent projects) every time
- add modification headers for touched files (and remove from a couple of old
    ones which have later been reverted)
- updating dependencies as per latest merge from branch-2.0
- fix scalastyle errors

Conflicts:
	sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java
	sql/catalyst/src/main/scala/org/apache/spark/sql/types/AbstractDataType.scala

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
* Using the scala collections
compiler was not able to differentiate between the scala and java api

* Minor changes for snappy implementation of executor

This is required as we need to have a classloader that also looks into the snappy store for the classes.

* Revert "Using the scala collections"

This reverts commit c2ab0c5.

Conflicts:
	core/src/main/scala/org/apache/spark/executor/Executor.scala
Conflicts:
	core/src/main/scala/org/apache/spark/Partitioner.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
 - for all cases of implicit casts, convert to date or timestamp values
   instead of string when one side is a string
 - likewise when one side is a timestamp and other date then both are being
   converted to string; now convert date to timestamp

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 - added a aggBufferAttributeForGroup to aggregates to be used to avoid nullable
   checks in generated code in aggregate buffers used in HashAggregateExec (if aggregate
       is on zero rows, then there will be no row in the map); accompanying "initialValuesForGroup"
   added for initial aggregation buffer values
 - use OpenHashMap in DictionaryEncoding which is faster than normal hash map;
   added clear methods to OpenHashMap/OpenHashSet for reuse
 - minor correction in the string in HiveUtils

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/metric/SQLMetrics.scala
	sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala
Used by the new benchmark from the PR adapted for SnappyData for its vectorized implementation.

Build updated to set testOutput and other variables instead of appending to existing values
(causes double append with both snappydata build adding and this adding for its tests)
- don't apply numBuckets in Shuffle partitioning since Shuffle cannot create
  a compatible partitioning with matching numBuckets (only numPartitions)
- check numBuckets too in HashPartitioning compatibility

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala
rishitesh and others added 3 commits March 7, 2018 23:59
use a future for enforcing timeout (2 x configured value) in netty RPC transfers
after which the channel will be closed and fail
Conflicts:
	core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala
ashetkar pushed a commit that referenced this pull request Apr 5, 2018
* Parse results of minikube status more rigorously

Prior code assumes the minikubeVM status line is always the first row output
from minikube status, and it is not when the version upgrade notifier prints
an upgrade suggestion message.

* Also filter ip response to expected rows
ymahajan and others added 14 commits April 5, 2018 17:18
reverting flag check optimization in Logging to be compatible with upstream Spark

Conflicts:
	core/src/main/scala/org/apache/spark/MapOutputTracker.scala
	core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala
	core/src/main/scala/org/apache/spark/internal/Logging.scala
	core/src/main/scala/org/apache/spark/storage/BlockManager.scala
	core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala
…kSession

This is to allow override by SnappySession extensions.

Conflicts:
	external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSourceSuite.scala
	sql/core/src/test/scala/org/apache/spark/sql/DatasetSerializerRegistratorSuite.scala
	sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala
also added MemoryMode in MemoryPool warning message

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
avoid materializing it immediately (for point queries that won't use it)

Conflicts:
	sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
The standalone cluster should support unique application names. As they are user visible and easy to track user can write scripts to kill applications by names.
Also, added support to kill Spark applications by names(case insensitive).
Conflicts:
	core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
Handled join order in optimization phase. Also removed custom changes in HashPartition. We won't store bucket information in HashPartitioning. Instead based on the flag "linkPartitionToBucket" we can determine the number of partitions to be either numBuckets or num cores assigned to the executor.
Reverted changes related to numBuckets in Snappy Spark.
Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala
	sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants