diff --git a/.github/workflows/pr.yaml b/.github/workflows/pr.yaml index 119dc1f1..27f8b63e 100644 --- a/.github/workflows/pr.yaml +++ b/.github/workflows/pr.yaml @@ -34,5 +34,5 @@ jobs: argo version - run: | - python -m poetry install --without docs,binary,perf,tutorial + python -m poetry install --without docs,binary,perf,tutorial,compare poetry run tox diff --git a/.github/workflows/release.yaml b/.github/workflows/release.yaml index 42ef27c9..c80689da 100644 --- a/.github/workflows/release.yaml +++ b/.github/workflows/release.yaml @@ -33,7 +33,7 @@ jobs: argo version - run: python -m pip install poetry - run: | - python -m poetry install --without docs,binary,perf,tutorial + python -m poetry install --without docs,binary,perf,tutorial,compare poetry run tox Release: diff --git a/docs/assets/work_dark.png b/docs/assets/work_dark.png index b5baa306..fe4a937d 100644 Binary files a/docs/assets/work_dark.png and b/docs/assets/work_dark.png differ diff --git a/docs/concepts/the-big-picture.md b/docs/concepts/the-big-picture.md deleted file mode 100644 index 998958d1..00000000 --- a/docs/concepts/the-big-picture.md +++ /dev/null @@ -1,217 +0,0 @@ -runnable revolves around the concept of pipelines or workflows and tasks that happen within them. - - -# Invert this and give examples - -- Explain task, python, notebook or shell. -- Then a pipeline stitching various tasks. -- Then flow of parameters/metrics/objects. -- File flow. -- secrets. -- parallel -- map - - ---- - -A [workflow](pipeline.md) is simply a series of steps that you want to execute for a desired outcome. - -``` mermaid -%%{ init: { 'flowchart': { 'curve': 'linear' } } }%% -flowchart LR - - step1:::green - step1([Step 1]) --> step2:::green - step2([Step 2]) --> step3:::green - step3([Step .. ]) --> step4:::green - step4([Step n]) --> suc([success]):::green - - classDef green stroke:#0f0 - -``` - -To define a workflow, we need: - -- [List of steps](pipeline.md/#steps) -- a [starting step](pipeline.md/#start_at) -- Next step - - - [In case of success](pipeline.md/#linking) - - [In case of failure](pipeline.md/#on_failure) - -- [Terminating](pipeline.md/#terminating) - -The workflow can be defined either in ```yaml``` or using the [```python sdk```](../sdk.md). - ---- - -A step in the workflow can be: - - -=== "task" - - A step in the workflow that does a logical unit work. - - The unit of work can be a [python function](task.md/#python_functions), - a [shell script](task.md/#shell) or a - [notebook](task.md/#notebook). - - All the logs, i.e stderr and stdout or executed notebooks are stored - in [catalog](catalog.md) for easier access and debugging. - - - -=== "stub" - - An [abstract step](stub.md) that is not yet fully implemented. - - For example in python: - - ```python - def do_something(): - pass - ``` - - -=== "parallel" - - A step that has a definite number of [parallel workflows](parallel.md) executing - simultaneously. - - In the below visualisation, the green lined steps happen in sequence and wait for the previous step to - successfully complete. - - The branches lined in yellow run in parallel to each other but sequential within the branch. - - ```mermaid - flowchart TD - - getFeatures([Get Features]):::green - trainStep(Train Models):::green - ensembleModel([Ensemble Modelling]):::green - inference([Run Inference]):::green - success([Success]):::green - - prepareXG([Prepare for XGBoost]):::yellow - trainXG([Train XGBoost]):::yellow - successXG([XGBoost success]):::yellow - prepareXG --> trainXG --> successXG - - trainRF([Train RF model]):::yellow - successRF([RF Model success]):::yellow - trainRF --> successRF - - - getFeatures --> trainStep - trainStep --> prepareXG - trainStep --> trainRF - successXG --> ensembleModel - successRF --> ensembleModel - ensembleModel --> inference - inference --> success - - - classDef yellow stroke:#FFFF00 - classDef green stroke:#0f0 - - - ``` - - -=== "map" - - A step that executes a workflow over an [iterable parameter](map.md). - - The step "chunk files" identifies the number of files to process and computes the start index of every - batch of files to process for a chunk size of 10, the stride. - - "Process Chunk" pipelines are then triggered in parallel to process the chunk of files between ```start index``` - and ```start index + stride``` - - ```mermaid - flowchart TD - chunkify([Chunk files]):::green - success([Success]):::green - - subgraph one[Process Chunk] - process_chunk1([Process Chunk]):::yellow - success_chunk1([Success]):::yellow - - process_chunk1 --> success_chunk1 - end - - subgraph two[Process Chunk] - process_chunk2([Process Chunk]):::yellow - success_chunk2([Success]):::yellow - - process_chunk2 --> success_chunk2 - end - - subgraph three[Process Chunk] - process_chunk3([Process Chunk]):::yellow - success_chunk3([Success]):::yellow - - process_chunk3 --> success_chunk3 - end - - subgraph four[Process Chunk] - process_chunk4([Process Chunk]):::yellow - success_chunk4([Success]):::yellow - - process_chunk4 --> success_chunk4 - end - - subgraph five[Process Chunk] - process_chunk5([Process Chunk]):::yellow - success_chunk5([Success]):::yellow - - process_chunk5 --> success_chunk5 - end - - - - chunkify -- (stride=10, start_index=0)--> one --> success - chunkify -- (stride=10, start_index=10)--> two --> success - chunkify -- (stride=10, start_index=20)--> three --> success - chunkify -- (stride=10, start_index=30)--> four --> success - chunkify -- (stride=10, start_index=40)--> five --> success - - classDef yellow stroke:#FFFF00 - classDef green stroke:#0f0 - ``` - - - ---- - -A [step type of task](task.md) is the functional unit of the pipeline. - -To be useful, it can: - -- Access parameters - - - Either [defined statically](parameters.md/#initial_parameters) at the start of the - pipeline - - Or by [upstream steps](parameters.md/#parameters_flow) - -- [Publish or retrieve artifacts](catalog.md) from/to other steps. - -- Have [access to secrets](secrets.md). - -All the above functionality is possible naturally with no intrusion into code base. - ---- - -All executions of the pipeline should be: - -- [Reproducible](run-log.md) for audit and data lineage purposes. -- Runnable in local environments for -[debugging failed runs](run-log.md/#retrying_failures). - ---- - -Executions of pipeline should be scalable and use the infrastructure at -your disposal efficiently. - -We achieve this by adding [one configuration file](../configurations/overview.md), rather than -changing the application code. diff --git a/docs/image.png b/docs/image.png new file mode 100644 index 00000000..61b12597 Binary files /dev/null and b/docs/image.png differ diff --git a/docs/index.md b/docs/index.md index d7659f90..228fd719 100644 --- a/docs/index.md +++ b/docs/index.md @@ -8,7 +8,9 @@ Runner icons created by Leremy - Flaticon ---- + + +
## Example @@ -78,8 +80,9 @@ The difference between native driver and runnable orchestration: - [x] Reproducible by default, runnable stores metadata about code/data/config for every execution. - [x] The pipeline is `runnable` in any environment. +
-## why runnable? +## Why runnable? Obviously, there are a lot of orchestration tools. A well maintained and curated [list is available here](https://github.com/EthicalML/awesome-production-machine-learning/). @@ -106,16 +109,14 @@ Broadly, they could be classed into ```native``` or ```meta``` orchestrators. - Easy to get started and run locally. - Ideal for quick experimentation or research activities. -```runnable``` is a _meta_ orchestrator with simple API, geared towards data engineering, data science activities. +```runnable``` is a _meta_ orchestrator with simple API, geared towards data engineering, data science projects. It works in conjunction with _native_ orchestrators and an alternative to [kedro](https://docs.kedro.org/en/stable/index.html) or [metaflow](https://metaflow.org/). +```runnable``` could also function as an SDK for _native_ orchestrators as it always compiles pipeline definitions +to _native_ orchestrators. - - -```runnable``` stands out based on these design principles. -
- :material-clock-fast:{ .lg .middle } __Easy to adopt, its mostly your code__ @@ -126,13 +127,13 @@ or [metaflow](https://metaflow.org/). - No API's or decorators or any imposed structure. - [:octicons-arrow-right-24: Getting started](concepts/the-big-picture.md) + [:octicons-arrow-right-24: Getting started](concepts/index.md) - :building_construction:{ .lg .middle } __Bring your infrastructure__ --- - Minimal disruption to your current infrastructure patterns. + ```runnable``` is not a platform. It works with your platforms. - ```runnable``` composes pipeline definitions suited to your infrastructure. @@ -173,7 +174,13 @@ or [metaflow](https://metaflow.org/). Moving away from runnable is as simple as deleting relevant files. + - Your application code remains as it is. +
-## Comparisons/alternatives +
+ +## Comparisons + +--8<-- "examples/comparisons/README.md" diff --git a/examples/comparisons/README.md b/examples/comparisons/README.md index 2c39671d..4e466ecf 100644 --- a/examples/comparisons/README.md +++ b/examples/comparisons/README.md @@ -34,11 +34,17 @@ func_task = PythonTask(name="function", function=func, returns=["z"], catalog=ca Below are the implementations in alternative frameworks. Note that the below are the best of our understanding of the frameworks, please let us -know if there are alternate implementations. +know if there are better implementations. + + +Along with the observations, we have implemented [MNIST example in pytorch](https://github.com/pytorch/examples/blob/main/mnist/main.py) +in multiple frameworks for comparing actual implementations against popular examples. + +
### metaflow -The function in metaflow's step would rougly be: +The function in metaflow's step would roughly be: ```python from metaflow import step, conda, FlowSpec @@ -62,8 +68,6 @@ class Flow(FlowSpec) The differences: - - ##### dependency management: ```runnable``` depends on the activated virtualenv for dependencies which is natural to python. @@ -99,9 +103,17 @@ pipelines themselves and can run in isolation. This is not true in ```metaflow`` ##### unit testing pipelines -```runnable``` pipelines are testable using ```mocked``` executor where the executables can be mocked/patched. In ```metaflow```, it depends on how the -python function is wrapped in the pipeline. +```runnable``` pipelines are testable using ```mocked``` executor where the executables can be mocked/patched. +In ```metaflow```, it depends on how the python function is wrapped in the pipeline. + +##### distributed training + +```metaflow``` supports distributed training. +As of now, ```runnable``` does not support distributed training but is in the works. + + +
### kedro @@ -128,17 +140,17 @@ def create_pipeline(**kwargs) -> Pipeline: ``` -##### Structure +##### Footprint -Kedro needs a structure and configuration to set up a new project and provides -a CLI to get started. +```kedro``` has a larger footprint in the domain code by the configuration files. It is tightly structured and +provides a CLI to get started. To use ```runnable``` as part of the project requires adding a pipeline definition file (in python or yaml) and an optional configuration file. ##### dataflow -Kedro needs the data flowing through the pipeline via catalog.yaml which +Kedro needs the data flowing through the pipeline via ```catalog.yaml``` which provides a central place to understand the data. In ```runnable```, the data is presented to the individual tasks as @@ -147,3 +159,13 @@ requested by the ```catalog``` instruction. ##### notebooks Kedro supports notebooks for exploration but not as tasks of the pipeline. + +##### dynamic pipelines + +```kedro``` does not support dynamic pipelines or map state. + +##### distributed training + +```kedro``` supports distributed training via a [plugin](https://github.com/getindata/kedro-azureml). + +As of now, ```runnable``` does not support distributed training but is in the works. diff --git a/examples/comparisons/kedro/README.md b/examples/comparisons/kedro/README.md new file mode 100644 index 00000000..1ce80264 --- /dev/null +++ b/examples/comparisons/kedro/README.md @@ -0,0 +1 @@ +Please [follow this repository](https://github.com/toohsk/kedro_gradio/tree/mnist-example) for the setup. diff --git a/examples/comparisons/kfp/README.md b/examples/comparisons/kfp/README.md new file mode 100644 index 00000000..0ee402f0 --- /dev/null +++ b/examples/comparisons/kfp/README.md @@ -0,0 +1 @@ +The best implementation would be [similar to this](https://medium.com/@lorenzo.colombi/kubeflow-pipeline-v2-tutorial-end-to-end-mnist-classifier-example-dc66714c2649). diff --git a/tox.ini b/tox.ini index e303c793..cc9b998a 100644 --- a/tox.ini +++ b/tox.ini @@ -9,11 +9,11 @@ whitelist_externals = poetry setenv = _PLOOMBER_TELEMETRY_DEBUG = false commands = - poetry install -E docker -E notebook --without docs,binary,perf,tutorial + poetry install -E docker -E notebook --without docs,binary,perf,tutorial,compare poetry run python -m pytest -m "not container" --cov=runnable/ tests/ [testenv:mypy] whitelist_externals = poetry commands = - poetry install -E docker -E notebook --without docs,binary,perf,tutorial + poetry install -E docker -E notebook --without docs,binary,perf,tutorial,compare poetry run mypy runnable