Skip to content

Integration Plan

Dong Xie edited this page Jul 16, 2017 · 1 revision

This document provides the main design ideas of integrating CarbonData and Simba. In general, we want to take CarbonData as a persistent layer for distributed spatial data so that Simba can read such data in a better layout and process different kind of spatio-temporal queries. The plan of integration comes with two stages: an early stage where we can make simply connect CarbonData to Simba by the DataSource API and an ultimate goal where spatial index structure can get down into CarbonData to support features like insertion and update.

Early Stage Integration

As Simba is an extension to Spark SQL, we can simply connect CarbonData and Simba by utilizing the public DataSource API provided by Spark SQL. In the early stage of integration, we plan to implement such an interface which enables Simba pull in data from HDFS and push down the predicate filters with projections as much as possible.

According to the current design of CarbonData, it is trivial to push down simply one-dimension predicates as B-Tree is supported in the existing index structure. It is also possible to support some of the multi-dimensional predicates by utilizing techniques like space filling curve. We can try to follow the methods adopted in Geomesa and see how it works in the case of CarbonData.

Although it is not hard to implement the functionalities above, it has limitations and difficulties on supporting complex query plans (e.g., joins) and cost-based optimizer. Besides, spatial index built by Simba will be resided in Spark native cache system, creating an additional copy of the data. And finally, as CarbonData is designed as a distributed file format, it is also not straightforward to support all functionalities including insertion and update in memory.

Task Items:

  • Implement the DataSource API for CarbonData with Simba predicates pushdown supported. As CarbonData has already got one implemented for Spark SQL, we just need to modify it so as to fit those predicates imported by Simba.
  • Fit the implemented DataSource API in \texttt{SparkStrategies} so that Simba can generate proper plans with RDDs and push down spatial predicates to CarbonData as much as possible.
  • (Potential) Try to support spatial data indexing using space filling curves and the B-Tree provided by CarbonData. On the other hand, implement algorithms based on such space filling curve indexes to support simply spatial queries.

Ultimate Goal

The ultimate goal of this integration is to make sure that Simba is able to get full benefits including performance and functionalities from CarbonData as a persistent storage layer. Meanwhile, CarbonData, as a data format, will get the benefit in supporting complex query types and flexible query expression out of box provided by Simba.

Towards the ultimate goal above, CarbonData needs to be more flexible in index structure type. Specifically, CarbonData should support different partition strategies and index structures while allows user to choose in advance. Besides, customizable secondary index is also important for supporting a wide variety of query types.

Next, supporting updates will be tricky if Spark cache is involved. Note that RDD is an abstraction designed to be static. Caching data or results in Spark cache would be a bad idea since it basically means we need to cache all the blocks again upon a single update. One possible solution is to separate incremental updates as different RDDs. As a result, the union of all those RDDs will be the proper presentation of the dataset in the system. To process queries with such abstraction, algorithms need to be run on disjoint parts of a data set and then a merge procedure is invoked to get the final result. Another possible solution to fix this problem is to get over Spark Cache and have our own main-memory management mechanism. Nevertheless, this will cause deep hack into Spark kernel, which is not quite preferred.

From Simba’s perspective, new Spark strategies and cost based optimizations need to be constructed to adopt the features provided by CarbonData. More specifically, we should explore how much computation we are able to and willing to push down into CarbonData. This is simply related to from which stage we want to check the results out from CarbonData as RDDs and send them to Simba.

Finally, seeking for possibilities to support much general geometric objects such as polygons and trajectories is also an important long-term goal for covering much more use cases in the future. Such goal will involves designing algorithms and mechanisms for partitioning, indexing and query processing under a distributed environment. In addition, designing a user-friendly query language or public interface is also of great importance.

Task Items:

  • Try to find a way to adopt multi-dimensional index structure in CarbonData. Good starting point can be try designing storage formats for distributed KD-Tree or R-Tree. We should also provide an option for users to choose which index type they want for their primary (or secondary) indexes.
  • Try to build a layer between Simba and CarbonData similar to traditional buffer pools so that we can avoid duplicating the whole data set in Spark cache system. It will also benefit the implementation of data insertion and incremental updates as such layer can be used as a traditional buffer layer.
  • Implement new physical plans in Simba so that it can talk directly to CarbonData and collaborate. Besides, more complicated CBO should be expected for these new plans.
  • Explore possibilities to support general geometric objects such as polygons and trajectories. This will include design new partitioning strategy, indexing methods and query algorithms.
Clone this wiki locally