Support structure IO format on Spark #11

advancedxy · 2017-11-27T16:28:09Z

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload]
(see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
Reuse and refactoring current PythonFromRecordBatchProcessor
Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:

Refactoring current ParquetSinker and Arrow Schema Converter
Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

Apache Arrow is a promising in-memory columnar storage, we can leverage more
power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

The text was updated successfully, but these errors were encountered:

advancedxy · 2017-12-06T02:37:48Z

@chunyang-wen apache/orc#188 Looks like ORC finishes their writing support in C++ API.

advancedxy added the enhancement label Nov 27, 2017

advancedxy assigned advancedxy and chunyang-wen Dec 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support structure IO format on Spark #11

Support structure IO format on Spark #11

advancedxy commented Nov 27, 2017

advancedxy commented Dec 6, 2017

Support structure IO format on Spark #11

Support structure IO format on Spark #11

Comments

advancedxy commented Nov 27, 2017

Definitions

Current Status

Parquet Support Architecture Overview on DCE

How to add support for spark pipeline

Read support

Impl details to add read support

Write support

References

advancedxy commented Dec 6, 2017