GitHub - yuribogomolov/skaryna: Spark SQL extension that improves performance

Welcome to Project Skaryna

Here I'm going to experiment with new features of Apache Spark, and document all results/outcomes. I picked GitHub format for two simple reasons:

Keep history of code changes, and provide reproducible examples
Provide an easy way to reuse created utility functions.

Spark SQL extensions

This section is focused on Spark SQL extensions. A more powerful version of this feature was released in Spark 2.2, but I couldn't find any documentation that cover the details, and provide examples.

Let's take a look at the Catalyst architecture:

Spark SQL has two interfaces:

DataFrame API
SQL queries

and both of them are transformed into a logical plan. Let's list the main phases:

Logical plan analyzis
Logical plan optimization
Conversion into a Physical Plan (SparkPlan)
Conversion into RDD operations

Spark extensions allow to customize all major phases:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
_config.yml		_config.yml
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Project Skaryna

Spark SQL extensions

About

Releases

Packages

Languages

License

yuribogomolov/skaryna

Folders and files

Latest commit

History

Repository files navigation

Welcome to Project Skaryna

Spark SQL extensions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages