Scope

The repo started out as ArrowFlightSql Server implementation with Apache Spark backend, but since then has evolved to encompass several independent components:

SparkFlightManager - a lower-level utility that enables easier development of any kind of distributed Arrow Flight servers on top of Apache Spark
SparkFlightSql - FlightSql implementation that builds on top of SparkFlightManager
FlightSql DataSourceV2 - Spark DataSourceV2 for FlightSql servers

SparkFlightManager

SparkFlightManager is a pluggable interface that governs how individual FlightServers communicate with each other. The goal is to abstract away distributed execution of Spark DataFrames from the business logic expressed in Flight Servers.

All the heavy lifting of conversion between a Spark DataFrame and Arrow Record Batches fortunately has already been done in Spark. DataFrame has a method toArrowBatchRdd (not part of public interface, though) that is used for Pandas UDFs.

ZookeeperSparkFlightManager is the only current implementation of SparkFlightManager. It requires an external Apache Zookeeper cluster. Flight servers use Zookeeper service to discover and communicate with each other once started. Metadata about running queries is also synchronized using Zookeeper.

For data exchange between individual Flight servers FlightManager uses an additional internal FlightServer that accompanies each public-facing one. When FlightManager is instructed from one of the FlightServers to distribute a Spark DataFrame to the available cluster, it starts Spark action from the requesting node but instead of collecting the results on the same node, DataFrame is instead converted to an RDD of ArrowRecordBatches and each partition calls doPut endpoint of the closest active internal FlightServer. (Note: Right now a random FlightServer is chosen) Therefore, the client receives FlightInfo containing tickets to all active FlightServers. Each FlightServer passes record batches that it's accompanying internal FlightServer received from executors to an appropriate client and the query finally completes once the FlightServer that started the query sends a notification that Spark action has completed.

spark.flight.manager - zookeeper
spark.flight.manager.zookeeper.url - localhost:10000
spark.flight.manager.zookeeper.membershipPath - "/spark-flight-sql"

There's a reference FlightServer implementation in com.tokoko.spark.flight.example.SparkParquetFlightProducer that illustrates how a simple parquet reader server can be implemented using SparkFlightManager.

SparkFlightSql

Goals

The goal of the project is to offer a SparkThriftServer alternative based on Arrow flight SQL protocol. SparkThriftServer has a number of limitations, mainly that it's a centralized server that needs to pass all query results to the client through a single Spark Driver process. SparkFlightSql aims to offer a distributed alternative where multiple servers will share the workload.

Metadata Queries

Metadata Queries are assumed not to require distributed serving and are always served with a single-endpoint FlightInfo.

Data Queries

Data Queries are assumed to require query distribution and therefore rely on SparkFlightManager. A client can submit a query to any active FlightServer. The server that receives the query will create a DataFrame and invoke SparkFlightManager to distribute query results. (see above)

Usage

A single node server can be started by running com.tokoko.spark.flight.SparkFlightSqlServer.

Alternatively, there is a Docker Compose file provided in dev folder that starts 2-node Spark Standalone Cluster with separate Hive Metastore. after running docker compose up, flight servers need to be started by docker exec -ti dev-spark-worker-b-1 bash /opt/spark-apps/run.sh and docker exec -ti dev-spark-worker-a-1 bash /opt/spark-apps/run.sh

Features

Feature	Status
Metadata Operations	In Progress
Read-Only Operations	In Progress
Distributed Server	In Progress
Prepared Statements	Planned
Streaming Queries	Planned
Spark UI Plugin	Planned
ZooKeeper Integration	In Progress
DML Operations	Planned
Basic Authentication	In Progress
LDAP Authentication	In Progress
Kerberos Authentication	Planned
Ranger Integration (Hive)	Planned
Storage Impersonation	Planned
Spark DataSourceV2	In Progress

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dev		dev
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scope

SparkFlightManager

SparkFlightSql

Goals

Metadata Queries

Data Queries

Usage

Features

About

Releases

Packages

Languages

tokoko/SparkFlightSql

Folders and files

Latest commit

History

Repository files navigation

Scope

SparkFlightManager

SparkFlightSql

Goals

Metadata Queries

Data Queries

Usage

Features

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages