Here I'm going to experiment with new features of Apache Spark, and document all results/outcomes. I picked GitHub format for two simple reasons:
- Keep history of code changes, and provide reproducible examples
- Provide an easy way to reuse created utility functions.
This section is focused on Spark SQL extensions. A more powerful version of this feature was released in Spark 2.2, but I couldn't find any documentation that cover the details, and provide examples.
Let's take a look at the Catalyst architecture:
Spark SQL has two interfaces:
- DataFrame API
- SQL queries
and both of them are transformed into a logical plan. Let's list the main phases:
- Logical plan analyzis
- Logical plan optimization
- Conversion into a Physical Plan (SparkPlan)
- Conversion into RDD operations