Summary Introduction Overview of Apache Spark Spark Core / Transferring Data Blocks In Spark Cluster ShuffleClient — Contract to Fetch Shuffle Blocks BlockTransferService — Pluggable Block Transfers (To Fetch and Upload Blocks) ExternalShuffleClient NettyBlockTransferService — Netty-Based BlockTransferService NettyBlockRpcServer — NettyBlockTransferService’s RpcHandler BlockFetchingListener RetryingBlockFetcher BlockFetchStarter Spark Core / Web UI Web UI — Spark Application’s Web Console Jobs Stages Storage Environment Executors JobsTab AllJobsPage JobPage StagesTab — Stages for All Jobs AllStagesPage — Stages for All Jobs StagePage — Stage Details PoolPage — Pool Details StorageTab StoragePage RDDPage EnvironmentTab EnvironmentPage ExecutorsTab ExecutorsPage ExecutorThreadDumpPage SparkUI — Web UI of Spark Application SparkUITab BlockStatusListener Spark Listener EnvironmentListener Spark Listener ExecutorsListener Spark Listener JobProgressListener Spark Listener StorageStatusListener Spark Listener StorageListener — Spark Listener for Tracking Persistence Status of RDD Blocks RDDOperationGraphListener Spark Listener WebUI — Framework For Web UIs WebUIPage — Contract of Pages in Web UI WebUITab — Contract of Tabs in Web UI RDDStorageInfo RDDInfo LiveEntity LiveRDD UIUtils JettyUtils web UI Configuration Properties Spark Core / Metrics Spark Metrics MetricsSystem MetricsConfig — Metrics System Configuration Source — Contract of Metrics Sources Sink — Contract of Metrics Sinks MetricsServlet JSON Metrics Sink Metrics Configuration Properties Spark Core / Status REST API Status REST API — Monitoring Spark Applications Using REST API ApiRootResource — /api/v1 URI Handler ApplicationListResource — applications URI Handler OneApplicationResource — applications/appId URI Handler StagesResource OneApplicationAttemptResource AbstractApplicationResource BaseAppResource ApiRequestContext UIRoot — Contract for Root Contrainers of Application UI Information UIRootFromServletContext Spark MLlib Spark MLlib — Machine Learning in Spark ML Pipelines (spark.ml) Pipeline PipelineStage Transformers Transformer Tokenizer Estimators Estimator StringIndexer KMeans TrainValidationSplit Predictor RandomForestRegressor Regressor LinearRegression Classifier RandomForestClassifier DecisionTreeClassifier Models Model Evaluator — ML Pipeline Component for Model Scoring BinaryClassificationEvaluator — Evaluator of Binary Classification Models ClusteringEvaluator — Evaluator of Clustering Models MulticlassClassificationEvaluator — Evaluator of Multiclass Classification Models RegressionEvaluator — Evaluator of Regression Models CrossValidator — Model Tuning / Finding The Best Model CrossValidatorModel ParamGridBuilder CrossValidator with Pipeline Example Params and ParamMaps ValidatorParams HasParallelism ML Persistence — Saving and Loading Models and Pipelines MLWritable MLReader Example — Text Classification Example — Linear Regression Logistic Regression LogisticRegression Latent Dirichlet Allocation (LDA) Vector LabeledPoint Streaming MLlib GeneralizedLinearRegression Alternating Least Squares (ALS) Matrix Factorization ALS — Estimator for ALSModel ALSModel — Model for Predictions ALSModelReader Instrumentation MLUtils Spark Core / Tools Spark Shell — spark-shell shell script Spark Submit — spark-submit shell script SparkSubmitArguments SparkSubmitOptionParser — spark-submit’s Command-Line Parser SparkSubmitCommandBuilder Command Builder spark-class shell script AbstractCommandBuilder SparkLauncher — Launching Spark Applications Programmatically Spark Core / Architecture Spark Architecture Driver Executor TaskRunner ExecutorSource Master Workers Spark Core / RDD Anatomy of Spark Application SparkConf — Programmable Configuration for Spark Applications Spark Properties and spark-defaults.conf Properties File Deploy Mode SparkContext HeartbeatReceiver RPC Endpoint Inside Creating SparkContext ConsoleProgressBar SparkStatusTracker Local Properties — Creating Logical Job Groups RDD — Resilient Distributed Dataset RDD RDD Lineage — Logical Execution Plan TaskLocation ParallelCollectionRDD MapPartitionsRDD OrderedRDDFunctions CoGroupedRDD SubtractedRDD HadoopRDD NewHadoopRDD ShuffledRDD Operators Transformations PairRDDFunctions Actions Caching and Persistence StorageLevel Partitions and Partitioning Partition Partitioner HashPartitioner Shuffling Checkpointing CheckpointRDD RDD Dependencies NarrowDependency — Narrow Dependencies ShuffleDependency — Shuffle Dependencies Map/Reduce-side Aggregator AppStatusStore AppStatusPlugin AppStatusListener KVStore KVStoreView ElementTrackingStore InMemoryStore LevelDB InterruptibleIterator — Iterator With Support For Task Cancellation Spark Core / Optimizations Broadcast variables Accumulators AccumulatorContext Spark Core / Services SerializerManager MemoryManager — Memory Management UnifiedMemoryManager — Spark’s Memory Manager StaticMemoryManager — Legacy Memory Manager MemoryManager Configuration Properties SparkEnv — Spark Runtime Environment DAGScheduler — Stage-Oriented Scheduler Jobs Stage — Physical Unit Of Execution ShuffleMapStage — Intermediate Stage in Execution DAG ResultStage — Final Stage in Job StageInfo DAGSchedulerSource — Metrics Source for DAGScheduler DAGScheduler Event Bus JobListener JobWaiter TaskScheduler — Spark Scheduler Tasks ShuffleMapTask — Task for ShuffleMapStage ResultTask FetchFailedException MapStatus — Shuffle Map Output Status TaskSet — Set of Tasks for Stage TaskSetManager Schedulable Schedulable Pool Schedulable Builders FIFOSchedulableBuilder FairSchedulableBuilder Scheduling Mode — spark.scheduler.mode Spark Property TaskInfo TaskDescription — Metadata of Single Task TaskSchedulerImpl — Default TaskScheduler Speculative Execution of Tasks TaskResultGetter TaskContext TaskContextImpl TaskResults — DirectTaskResult and IndirectTaskResult TaskMemoryManager — Memory Manager of Single Task MemoryConsumer TaskMetrics ShuffleWriteMetrics TaskSetBlacklist — Blacklisting Executors and Nodes For TaskSet SchedulerBackend — Pluggable Scheduler Backends CoarseGrainedSchedulerBackend DriverEndpoint — CoarseGrainedSchedulerBackend RPC Endpoint ExecutorBackend — Pluggable Executor Backends CoarseGrainedExecutorBackend MesosExecutorBackend BlockManager — Key-Value Store of Blocks of Data MemoryStore BlockEvictionHandler StorageMemoryPool MemoryPool DiskStore BlockDataManager RpcHandler RpcResponseCallback TransportRequestHandler TransportContext TransportServer TransportClientFactory MessageHandler BlockManagerMaster — BlockManager for Driver BlockManagerMasterEndpoint — BlockManagerMaster RPC Endpoint DiskBlockManager BlockInfoManager BlockInfo BlockManagerSlaveEndpoint DiskBlockObjectWriter BlockManagerSource — Metrics Source for BlockManager ShuffleMetricsSource — Metrics Source of BlockManager for Shuffle-Related Metrics StorageStatus ManagedBuffer MapOutputTracker — Shuffle Map Output Registry MapOutputTrackerMaster — MapOutputTracker For Driver MapOutputTrackerMasterEndpoint MapOutputTrackerWorker — MapOutputTracker for Executors ShuffleManager — Pluggable Shuffle Systems SortShuffleManager — The Default Shuffle System ExternalShuffleService OneForOneStreamManager ShuffleBlockResolver IndexShuffleBlockResolver ShuffleWriter BypassMergeSortShuffleWriter SortShuffleWriter UnsafeShuffleWriter — ShuffleWriter for SerializedShuffleHandle BaseShuffleHandle — Fallback Shuffle Handle BypassMergeSortShuffleHandle — Marker Interface for Bypass Merge Sort Shuffle Handles SerializedShuffleHandle — Marker Interface for Serialized Shuffle Handles ShuffleReader BlockStoreShuffleReader ShuffleBlockFetcherIterator ShuffleExternalSorter — Cache-Efficient Sorter ExternalSorter Serialization Serializer — Task SerDe SerializerInstance SerializationStream DeserializationStream ExternalClusterManager — Pluggable Cluster Managers BroadcastManager BroadcastFactory — Pluggable Broadcast Variable Factories TorrentBroadcastFactory TorrentBroadcast CompressionCodec ContextCleaner — Spark Application Garbage Collector CleanerListener Dynamic Allocation (of Executors) ExecutorAllocationManager — Allocation Manager for Spark Core ExecutorAllocationClient ExecutorAllocationListener ExecutorAllocationManagerSource HTTP File Server Data Locality Cache Manager OutputCommitCoordinator RpcEnv — RPC Environment RpcEndpoint RpcEndpointRef RpcEnvFactory Netty-based RpcEnv TransportConf — Transport Configuration Utils Helper Object Spark Core / Security Securing Web UI Spark Deployment Environments Deployment Environments — Run Modes Spark local (pseudo-cluster) LocalSchedulerBackend LocalEndpoint Spark on cluster Spark on YARN Spark on YARN YarnShuffleService — ExternalShuffleService on YARN ExecutorRunnable Client YarnRMClient ApplicationMaster AMEndpoint — ApplicationMaster RPC Endpoint YarnClusterManager — ExternalClusterManager for YARN TaskSchedulers for YARN YarnScheduler YarnClusterScheduler SchedulerBackends for YARN YarnSchedulerBackend YarnClientSchedulerBackend YarnClusterSchedulerBackend YarnSchedulerEndpoint RPC Endpoint YarnAllocator Introduction to Hadoop YARN Setting up YARN Cluster Kerberos ConfigurableCredentialManager ClientDistributedCacheManager YarnSparkHadoopUtil Settings Spark Standalone Spark Standalone Standalone Master — Cluster Manager of Spark Standalone Standalone Worker web UI ApplicationPage LocalSparkCluster — Single-JVM Spark Standalone Cluster Submission Gateways Management Scripts for Standalone Master Management Scripts for Standalone Workers Checking Status Example 2-workers-on-1-node Standalone Cluster (one executor per worker) StandaloneSchedulerBackend Spark on Mesos Spark on Mesos MesosCoarseGrainedSchedulerBackend About Mesos Execution Model Execution Model Monitoring, Tuning and Debugging Unified Memory Management Spark History Server HistoryServer — WebUI For Active And Completed Spark Applications SQLHistoryListener FsHistoryProvider — File-System-Based History Provider ApplicationHistoryProvider HistoryServerArguments ApplicationCacheOperations ApplicationCache Logging Performance Tuning SparkListener — Intercepting Events from Spark Scheduler LiveListenerBus ReplayListenerBus SparkListenerBus — Internal Contract for Spark Event Buses EventLoggingListener — Spark Listener for Persisting Events StatsReportListener — Logging Summary Statistics JsonProtocol Debugging Spark Varia Building Apache Spark from Sources Spark and Hadoop SparkHadoopUtil Spark and software in-memory file systems Spark and The Others Distributed Deep Learning on Spark Spark Packages Interactive Notebooks Interactive Notebooks Apache Zeppelin Spark Notebook Spark Tips and Tricks Spark Tips and Tricks Access private members in Scala in Spark shell SparkException: Task not serializable Running Spark Applications on Windows Exercises One-liners using PairRDDFunctions Learning Jobs and Partitions Using take Action Spark Standalone - Using ZooKeeper for High-Availability of Master Spark’s Hello World using Spark shell and Scala WordCount using Spark shell Your first complete Spark application (using Scala and sbt) Spark (notable) use cases Using Spark SQL to update data in Hive using ORC files Developing Custom SparkListener to monitor DAGScheduler in Scala Developing RPC Environment Developing Custom RDD Working with Datasets from JDBC Data Sources (and PostgreSQL) Causing Stage to Fail Further Learning Courses Books (separate book) Spark SQL Spark SQL — Batch and Streaming Queries Over Structured Data on Massive Scale (separate book) Spark Structured Streaming Spark Structured Streaming — Streaming Datasets (obsolete) Spark Streaming Spark Streaming — Streaming RDDs BlockRDD (obsolete) Spark GraphX Spark GraphX — Distributed Graph Computations Graph Algorithms