You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Structure input formats specifically mean ORC file and Parquet file.
Current Status
Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.
For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.
For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.
Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.
Parquet Support Architecture Overview on DCE
ORC loader follows similar procedure.
How to add support for spark pipeline
Read support
The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload]
(see Dataset.scala), though not publicly.
It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.
Impl details to add read support
Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
Reuse and refactoring current PythonFromRecordBatchProcessor
Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task
Write support
Bigflow uses its own sinker impl to write PCollection(or PType) into external target.
Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:
Refactoring current ParquetSinker and Arrow Schema Converter
Add write support for ORC files. (ORC's cpp API is adding write support incrementally)
References
Apache Arrow is a promising in-memory columnar storage, we can leverage more
power on it. See Arrow SlideShare
Definitions
Structure input formats specifically mean ORC file and Parquet file.
Current Status
Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.
For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.
For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.
Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.
Parquet Support Architecture Overview on DCE
ORC loader follows similar procedure.
How to add support for spark pipeline
Read support
The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform
Dataset
toRDD[ArrowPayload]
(see Dataset.scala), though not publicly.
It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.
Impl details to add read support
toArrowPayload
in flume-rumtime as Spark doesn't expose that publiclyPythonFromRecordBatchProcessor
Write support
Bigflow uses its own sinker impl to write PCollection(or PType) into external target.
Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:
References
power on it. See Arrow SlideShare
cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated
The text was updated successfully, but these errors were encountered: