Skip to content

Latest commit

 

History

History
55 lines (47 loc) · 3.04 KB

overview.md

File metadata and controls

55 lines (47 loc) · 3.04 KB

Data Processing Overview

The Data Processing Framework is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. Various runtimes are available to execute the transforms using a common shared methodology and mechanism to configure input and output across either local or S3-base storage.

The framework allows simple 1:1 transformation of (parquet) files, but also enables more complex transformations requiring coordination among transforming nodes. This might include operations such as de-duplication, merging, and splitting. The framework uses a plug-in model for the primary functions. The core transformation-specific classes/interfaces are as follows:

  • AbstractBinaryTransform - a simple, easily-implemented interface allowing the definition transforms of arbitrary data as a byte array. Additionally table transform interface is provided allowing definition of transforms operating on pyarrow tables.
  • TransformConfiguration - defines the transform short name, its implementation class, and command line configuration parameters.

In support of running a transform over a set of input data in a runtime, the following class/interfaces are provided:

  • AbstractTransformLauncher - is the central runtime interfacee expected to be implemented by each runtime (python ray, spark, etc.) to apply a transform to a set of data. It is configured with a TransformRuntimeConfiguration and a DataAccessFactory instance (see below).
  • DataAccessFactory - is used to configure the input and output data files to be processed and creates the DataAccess instance (see below) according to the CLI parameters.
  • TransformRuntimeConfiguration - captures the TransformConfiguration and runtime-specific configuration.
  • DataAccess - is the interface defining data i/o methods and selection. Implementations for local and S3 storage are provided.

Core Framework Classes

To learn more consider the following: