Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support structure IO format on Spark #11

Open
advancedxy opened this issue Nov 27, 2017 · 1 comment
Open

Support structure IO format on Spark #11

advancedxy opened this issue Nov 27, 2017 · 1 comment
Assignees

Comments

@advancedxy
Copy link
Collaborator

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

parquet_architecture

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow
RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload]
(see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

  1. Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
  2. Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
  3. Reuse and refactoring current PythonFromRecordBatchProcessor
  4. Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is
needed, namely:

  1. Refactoring current ParquetSinker and Arrow Schema Converter
  2. Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

  1. Apache Arrow is a promising in-memory columnar storage, we can leverage more
    power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

@advancedxy
Copy link
Collaborator Author

@chunyang-wen apache/orc#188 Looks like ORC finishes their writing support in C++ API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants