Skip to content

Latest commit

 

History

History
114 lines (59 loc) · 4.67 KB

README.md

File metadata and controls

114 lines (59 loc) · 4.67 KB

Week 5: Batch Processing

5.1 Introduction

  • 🎥 5.1.1 Introduction to Batch Processing

  • 🎥 5.1.2 Introduction to Spark

5.2 Installation

Follow these intructions to install Spark:

And follow this to run PySpark in Jupyter

  • 🎥 5.2.1 (Optional) Installing Spark (Linux)

5.3 Spark SQL and DataFrames

  • 🎥 5.3.1 First Look at Spark/PySpark

  • 🎥 5.3.2 Spark Dataframes

  • 🎥 5.3.3 (Optional) Preparing Yellow and Green Taxi Data

Script to prepare the Dataset download_data.sh

Note

The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema option to true while reading the files in Spark.

  • 🎥 5.3.4 SQL with Spark

5.4 Spark Internals

  • 🎥 5.4.1 Anatomy of a Spark Cluster

  • 🎥 5.4.2 GroupBy in Spark

  • 🎥 5.4.3 Joins in Spark

5.5 (Optional) Resilient Distributed Datasets

  • 🎥 5.5.1 Operations on Spark RDDs

  • 🎥 5.5.2 Spark RDD mapPartition

5.6 Running Spark in the Cloud

  • 🎥 5.6.1 Connecting to Google Cloud Storage

  • 🎥 5.6.2 Creating a Local Spark Cluster

  • 🎥 5.6.3 Setting up a Dataproc Cluster

  • 🎥 5.6.4 Connecting Spark to Big Query

Homework

Community notes

Did you take notes? You can share them here.