Skip to content

Latest commit

 

History

History
260 lines (167 loc) · 11.5 KB

File metadata and controls

260 lines (167 loc) · 11.5 KB

Comparison of Python pipeline packages: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX

This article compares open-source Python packages for pipeline/workflow development: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX.

In this article, terms of "pipeline", "workflow", and "DAG" are used almost interchangeably.

Summary

  • 👍: good
  • 👍👍: better
Package Airflow Luigi    Gokart Metaflow Kedro    PipelineX
Developer, Maintainer Airbnb, Apache Spotify M3 Netflix Quantum-Black (McKinsey) Yusuke Minami
Wrapped packages Luigi Kedro, MLflow
Easiness/flexibility to define DAG 👍 👍 👍 👍👍
Modularity of DAG definition 👍👍 👍👍 👍👍
Unstructured data can be passed between tasks 👍👍 👍👍 👍👍 👍👍 👍👍
Built-in various data (file/database) existence check wrappers 👍👍 👍👍 👍👍 👍👍
Built-in various data (file/database) operation (read/write) wrappers 👍 👍👍 👍👍
Modularity, reusability, testability of data operation 👍 👍👍 👍👍
Automatic resuming option by detecting the intermediate data 👍👍 👍👍 👍👍
Force rerun of tasks by detecting parameter change 👍👍
Save parameters for experiments 👍👍 👍👍
Parallel execution 👍 👍 👍 👍 👍 👍
Distributed parallel execution with Celery 👍👍
Visualization of DAG 👍👍 👍 👍 👍 👍
Execution status monitoring in GUI 👍👍 👍 👍
Scheduling, Triggering in GUI 👍
Notification to Slack 👍 👍

Airflow

https://github.com/apache/airflow

Released in 2015 by Airbnb.

Airflow enables you to define your DAG (workflow) of tasks in Python code (an independent Python module).

(Optionally, unofficial plugins such as dag-factory enables you to define DAG in YAML.)

Pros:

  • Provides rich GUI with features including DAG visualization, execution progress monitoring, scheduling, and triggering.
  • Provides distributed computing option (using Celery).
  • DAG definition is modular; independent from processing functions.
  • Workflow can be nested using SubDagOperator.
  • Supports Slack notification.

Cons:

  • Not designed to pass data between dependent tasks without using a database. There is no good way to pass unstructured data (e.g. image, video, pickle, etc.) between dependent tasks in Airflow.
  • You need to write file access (read/write) code.
  • Does not support automatic pipeline resuming option using the intermediate data files or databases.

Luigi

https://github.com/spotify/luigi

Released in 2012 by Spotify.

Luigi enables you to define your pipeline by child classes of Task with 3 class methods (requires, output, run) in Python code.

Pros:

  • Support automatic pipeline resuming option using the intermediate data files in local or cloud (AWS, GCP, Azure) or databases as defined in Task.output method using Target class.
  • You can write code so any data can be passed between dependent tasks.
  • Provides GUI with features including DAG visualization, execution progress monitoring.

Cons:

  • You need to write file/database access (read/write) code.
  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are tightly coupled and not modular. You need to modify the task classes to reuse in future projects.

Gokart

https://github.com/m3dev/gokart

Released in Dec 2018 by M3.

Gokart works on top of Luigi.

Pros:

In addition to Luigi's advantages:

  • Can split task processing (Transform of ETL) from pipeline definition using TaskInstanceParameter so you can easily reuse them in future projects.
  • Provides built-in file access (read/write) wrappers as FileProcessor classes for pickle, npz, gz, txt, csv, tsv, json, xml.
  • Saves parameters for each experiment to assure reproducibility. Viewer called thunderbolt can be used.
  • Reruns tasks upon parameter change based on hash string unique to the parameter set in each intermediate file name. This feature is useful for experimentation with various parameter sets.
  • Syntactic sugar for Luigi's requires class method using class decorator.
  • Supports Slack notification.

Cons:

  • Supported data formats for file access wrappers are limited. You need to write file/database access (read/write) code to use unsupported formats.

Metaflow

https://github.com/Netflix/metaflow

Released in Dec 2019 by Netflix.

Metaflow enables you to define your pipeline as a child class of FlowSpec that includes class methods with step decorators in Python code.

Pros:

  • Integration with AWS services (Especially AWS Batch).

Cons:

  • You need to write file/database access (read/write) code.
  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are tightly coupled and not modular. You need to modify the task classes to reuse in future projects.
  • Does not support GUI.
  • Not much support for GCP & Azure.
  • Does not support automatic pipeline resuming option using the intermediate data files or databases.

Kedro

https://github.com/quantumblacklabs/kedro

Released in May 2019 by QuantumBlack, part of McKinsey & Company.

Kedro enables you to define pipelines using list of node functions with 3 arguments (func: task processing function, inputs: input data name (list or dict if multiple), outputs: output data name (list or dict if multiple)) in Python code (an independent Python module).

Pros:

  • Provides built-in file/database access (read/write) wrappers as DataSet classes for CSV, Pickle, YAML, JSON, Parquet, Excel, and text in local or cloud (S3 in AWS, GCS in GCP), as well as SQL, Spark, etc.
  • Any data format support can be added by users.
  • Pipeline definition, task processing (Transform of ETL), and data access (Extract&Load of ETL) are independent and modular. You can easily reuse in future projects.
  • Pipelines can be nested. (A pipeline can be used as a sub-pipeline of another pipeline. )
  • GUI (kedro-viz) provides DAG visualization feature.

Cons:

  • Does not support automatic pipeline resuming option using the intermediate data files or databases.
  • GUI (kedro-viz) does not provide execution progress monitoring feature.
  • Package dependencies which are not used in many cases (e.g. pyarrow) are included in the requirements.txt.

PipelineX:

https://github.com/Minyus/pipelinex

Released in Nov 2019 by a Kedro user (me).

PipelineX works on top of Kedro and MLflow.

PipelineX enables you to define your pipeline in YAML (an independent YAML file).

Pros:

In addition to Kedro's advantages:

  • Supports automatic pipeline resuming option using the intermediate data files or databases.
  • Optional syntactic sugar for Kedro Pipeline. (e.g. Sequential API similar to PyTorch (torch.nn.Sequential) and Keras (tf.keras.Sequential))
  • Optional syntactic sugar for Kedro DataSet catalog. (e.g. Use file name in the file path as the dataset instance name)
  • Backward-compatible to pure Kedro.
  • Integration with MLflow to save parameters, metrics, and other output artifacts such as models for each experiment.
  • Integration with common packages for Data Science: PyTorch, Ignite, pandas, OpenCV.
  • Additional DataSet including image set (a folder including images) useful for computer vision applications.
  • Lean project template compared with pure Kedro.

Cons:

  • GUI (kedro-viz) does not provide execution progress monitoring feature.
  • Package dependencies which are not used in many cases (e.g. pyarrow) are included in the requirements.txt of Kedro.
  • PipelineX is developed and maintained by an individual (me) at this moment.

Platform-specific options

Argo

https://github.com/argoproj/argo

Uses Kubernetes to run pipelines.

Kubeflow Pipelines

https://github.com/kubeflow/pipelines

Works on top of Argo.

Oozie

https://github.com/apache/oozie

Manages Hadoop jobs.

Azkaban

https://github.com/azkaban/azkaban

Manages Hadoop jobs.

GitLab CI/CD

https://docs.gitlab.com/ee/ci/

  • Runs pipelines defined in YAML.
  • Supports triggering by git push, CRON-style scheduling, and manual clicking.
  • Supports Docker containers.

References

Airflow

Luigi

Gokart

Metaflow

Kedro

PipelineX

Airflow vs Luigi

Inaccuracies

Please kindly let me know if you find anything inaccurate.

Pull requests for https://github.com/Minyus/Python_Packages_for_Pipeline_Workflow/blob/master/README.md are welcome.