- 🎥 5.1.1 Introduction to Batch Processing
- 🎥 5.1.2 Introduction to Spark
Follow these intructions to install Spark:
And follow this to run PySpark in Jupyter
- 🎥 5.2.1 (Optional) Installing Spark (Linux)
- 🎥 5.3.1 First Look at Spark/PySpark
- 🎥 5.3.2 Spark Dataframes
- 🎥 5.3.3 (Optional) Preparing Yellow and Green Taxi Data
Script to prepare the Dataset download_data.sh
Note
The other way to infer the schema (apart from pandas) for the csv files, is to set the inferSchema
option to true
while reading the files in Spark.
- 🎥 5.3.4 SQL with Spark
- 🎥 5.4.1 Anatomy of a Spark Cluster
- 🎥 5.4.2 GroupBy in Spark
- 🎥 5.4.3 Joins in Spark
- 🎥 5.5.1 Operations on Spark RDDs
- 🎥 5.5.2 Spark RDD mapPartition
- 🎥 5.6.1 Connecting to Google Cloud Storage
- 🎥 5.6.2 Creating a Local Spark Cluster
- 🎥 5.6.3 Setting up a Dataproc Cluster
- 🎥 5.6.4 Connecting Spark to Big Query
Did you take notes? You can share them here.