Skip to content

Commit

Permalink
docs: updating docs
Browse files Browse the repository at this point in the history
  • Loading branch information
vijayvammi committed Jun 11, 2024
1 parent 5d12a09 commit 65bff88
Show file tree
Hide file tree
Showing 4 changed files with 133 additions and 20 deletions.
Binary file added docs/assets/work_dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/work_light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 31 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,41 @@ The difference between native driver and runnable orchestration:
- [x] The pipeline is `runnable` in any environment.


## But why runnable?
## why runnable?

Obviously, there are a lot of orchestration tools. A well maintained and curated [list is
available here](https://github.com/EthicalML/awesome-production-machine-learning/).

Broadly, they could be classed into ```native``` or ```meta``` orchestrators.

<figure markdown>
![Image title](assets/work_light.png#only-light){ width="600" height="300"}
![Image title](assets/work_dark.png#only-dark){ width="600" height="300"}
</figure>


### __native orchestrators__

- Focus on resource management, job scheduling, robustness and scalability.
- Have less features on domain (data engineering, data science) activities.
- Difficult to run locally.
- Not ideal for quick experimentation or research activities.

### __meta orchestrators__

- An abstraction over native orchestrators.
- Oriented towards domain (data engineering, data science) features.
- Easy to get started and run locally.
- Ideal for quick experimentation or research activities.

```runnable``` is a _meta_ orchestrator with simple API, geared towards data engineering, data science activities.
It works in conjunction with _native_ orchestrators and an alternative to [kedro](https://docs.kedro.org/en/stable/index.html)
or [metaflow](https://metaflow.org/).





```runnable``` stands out based on these design principles.

<div class="grid cards" markdown>
Expand Down
121 changes: 102 additions & 19 deletions examples/comparisons/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,40 @@
In this section, we take the familiar MNIST problem and implement it in different orchestration frameworks.

The [original source code](https://github.com/pytorch/examples/blob/main/mnist/main.py) is shown in [source.py](source.py)

The individual directories are orchestration specific implementations.

## Notes

For the purpose of comparisons, consider the following function:

```python
def func(x: int, y:pd.DataFrame):
# Access some data, input.csv
# do something with the inputs.
# Write a file called data.csv for downstream steps.
# Write a file called output.csv for downstream steps.
# return an output.
return z
```

It takes *inputs* x (integer) and y (a pandas dataframe or any other object), does some processing and writes a file to local disk and returns z (a simple datatype or object)
It takes

- *inputs* x (integer) and y (a pandas dataframe or any other object),
- processes input data, input.csv expected on local file system
- writes a file, output.csv to local filesystem
- returns z (a simple datatype or object)

The function in wrapped in runnable as:

```python
from somewhere import func
from runnable import PythonTask, pickled, Catalog

# instruction to move the file data.csv from local disk to a blob store
catalog = Catalog(put=["data.csv"])
# instruction to get input.csv from catalog at the start of the step.
# and move output.csv to the catalog at the end of the step
catalog = Catalog(get=["input.csv"], put=["output.csv"])

# Call the function, func and expect it to return "z" while moving the files
# It is expected that "x" and "y" are parameters set by some upstream step.
# If the return parameter is an object, use pickled("z")
func_task = PythonTask(name="function", function=func, returns=["z"], catalog=catalog)
```

Below are the implementations in alternative frameworks. Note that
the below are the best of our understanding of the frameworks, please let us
know if there are alternate implementations.

### metaflow

Expand All @@ -55,12 +58,92 @@ class Flow(FlowSpec)

- The API between ```runnable``` and ```metaflow``` are comparable.
- There is a mechanism for functions to accept/return parameters.
- Both support parallel branches, arbitrary nesting of pipelines.

The differences:



##### dependency management:

```runnable``` depends on the activated virtualenv for dependencies which is natural to python.
Use custom docker images to provide the same environment in cloud based executions.

```metaflow``` uses decorators (conda, pypi) to specify dependencies. This has an advantage
of abstraction from docker ecosystem for the user.

##### dataflow:

In ```runnable```, data flow between steps is by an instruction in runnable to ```glob``` files in
local disk and present them in the same structure to downstream steps.

```metaflow``` needs a code based instruction to do so.

##### notebooks:

```runnable``` allows notebook as tasks. Notebooks can take JSON style inputs and can return
pythonic objects for downstream steps.

```metaflow``` does not support notebooks as tasks.

##### infrastructure:

```runnable```, in many ways, is just a transpiler to your chosen infrastructure.

```metaflow``` is a platform with its own specified infrastructure.

##### modular pipelines

In ```runnable``` the individual pipelines of parallel and map states are
pipelines themselves and can run in isolation. This is not true in ```metaflow```.

##### unit testing pipelines

```runnable``` pipelines are testable using ```mocked``` executor where the executables can be mocked/patched. In ```metaflow```, it depends on how the
python function is wrapped in the pipeline.


### kedro

The function in ```kedro``` implementation would roughly be:

Note that any movement of files should happen via data catalog.

```python
from kedro.pipeline import Pipeline, node, pipeline
from somewhere import func

def create_pipeline(**kwargs) -> Pipeline:
return pipeline(
[
node(
func=func,
inputs=["params:x", "y"],
outputs=["z"],
name="my_function",
),
...
]
)

```

##### Structure

Kedro needs a structure and configuration to set up a new project and provides
a CLI to get started.

To use ```runnable``` as part of the project requires
adding a pipeline definition file (in python or yaml) and an optional configuration file.

##### dataflow

Kedro needs the data flowing through the pipeline via catalog.yaml which
provides a central place to understand the data.

In ```runnable```, the data is presented to the individual tasks as
requested by the ```catalog``` instruction.

The differences between both:
##### notebooks

- dependency management in ```runnable``` is expected to be user driven and "pythonic" (venv, poetry etc) while metaflow packages provides a per-step (or flow level) libraries.
- data flow between steps is by an instruction in runnable to ```glob``` files in
local disk and present them in the same structure to downstream steps. ```metaflow``` needs this to be via code.
- runnable allows ```notebooks``` to be a task, allowing simple data types to be
parameters while collecting objects from the notebook execution.
- metaflow has a *platform* side to it and comes up with prescribed infrastructure while runnable is, in many ways, a transpiler to your chosen infrastructure.
Kedro supports notebooks for exploration but not as tasks of the pipeline.

0 comments on commit 65bff88

Please sign in to comment.