docs: updating docs

AstraZeneca · Jun 11, 2024 · 65bff88 · 65bff88
1 parent 5d12a09
commit 65bff88
Show file tree

Hide file tree

Showing 4 changed files with 133 additions and 20 deletions.
diff --git a/docs/assets/work_dark.png b/docs/assets/work_dark.png
diff --git a/docs/assets/work_light.png b/docs/assets/work_light.png
diff --git a/docs/index.md b/docs/index.md
@@ -79,11 +79,41 @@ The difference between native driver and runnable orchestration:
 - [x] The pipeline is `runnable` in any environment.
 
 
-## But why runnable?
+## why runnable?
 
 Obviously, there are a lot of orchestration tools. A well maintained and curated [list is
 available here](https://github.com/EthicalML/awesome-production-machine-learning/).
 
+Broadly, they could be classed into ```native``` or ```meta``` orchestrators.
+
+<figure markdown>
+  ![Image title](assets/work_light.png#only-light){ width="600" height="300"}
+  ![Image title](assets/work_dark.png#only-dark){ width="600" height="300"}
+</figure>
+
+
+### __native orchestrators__
+
+- Focus on resource management, job scheduling, robustness and scalability.
+- Have less features on domain (data engineering, data science) activities.
+- Difficult to run locally.
+- Not ideal for quick experimentation or research activities.
+
+### __meta orchestrators__
+
+- An abstraction over native orchestrators.
+- Oriented towards domain (data engineering, data science) features.
+- Easy to get started and run locally.
+- Ideal for quick experimentation or research activities.
+
+```runnable``` is a _meta_ orchestrator with simple API, geared towards data engineering, data science activities.
+It works in conjunction with _native_ orchestrators and an alternative to [kedro](https://docs.kedro.org/en/stable/index.html)
+or [metaflow](https://metaflow.org/).
+
+
+
+
+
 ```runnable``` stands out based on these design principles.
 
 <div class="grid cards" markdown>

diff --git a/examples/comparisons/README.md b/examples/comparisons/README.md
@@ -1,37 +1,40 @@
-In this section, we take the familiar MNIST problem and implement it in different orchestration frameworks.
-
-The [original source code](https://github.com/pytorch/examples/blob/main/mnist/main.py) is shown in [source.py](source.py)
-
-The individual directories are orchestration specific implementations.
-
-## Notes
-
 For the purpose of comparisons, consider the following function:
 
 ```python
 def func(x: int, y:pd.DataFrame):
+    # Access some data, input.csv
     # do something with the inputs.
-    # Write a file called data.csv for downstream steps.
+    # Write a file called output.csv for downstream steps.
     # return an output.
     return z
 ```
 
-It takes *inputs* x (integer) and y (a pandas dataframe or any other object), does some processing and writes a file to local disk and returns z (a simple datatype or object)
+It takes
+
+- *inputs* x (integer) and y (a pandas dataframe or any other object),
+- processes input data, input.csv expected on local file system
+- writes a file, output.csv to local filesystem
+- returns z (a simple datatype or object)
 
 The function in wrapped in runnable as:
 
 ```python
 from somewhere import func
 from runnable import PythonTask, pickled, Catalog
 
-# instruction to move the file data.csv from local disk to a blob store
-catalog = Catalog(put=["data.csv"])
+# instruction to get input.csv from catalog at the start of the step.
+# and move output.csv to the catalog at the end of the step
+catalog = Catalog(get=["input.csv"], put=["output.csv"])
+
 # Call the function, func and expect it to return "z" while moving the files
 # It is expected that "x" and "y" are parameters set by some upstream step.
 # If the return parameter is an object, use pickled("z")
 func_task = PythonTask(name="function", function=func, returns=["z"], catalog=catalog)
 ```
 
+Below are the implementations in alternative frameworks. Note that
+the below are the best of our understanding of the frameworks, please let us
+know if there are alternate implementations.
 
 ### metaflow
 
@@ -55,12 +58,92 @@ class Flow(FlowSpec)
 
 - The API between ```runnable``` and ```metaflow``` are comparable.
 - There is a mechanism for functions to accept/return parameters.
+- Both support parallel branches, arbitrary nesting of pipelines.
+
+The differences:
+
+
+
+##### dependency management:
+
+```runnable``` depends on the activated virtualenv for dependencies which is natural to python.
+Use custom docker images to provide the same environment in cloud based executions.
+
+```metaflow``` uses decorators (conda, pypi) to specify dependencies. This has an advantage
+of abstraction from docker ecosystem for the user.
+
+##### dataflow:
+
+In ```runnable```, data flow between steps is by an instruction in runnable to ```glob``` files in
+local disk and present them in the same structure to downstream steps.
+
+```metaflow``` needs a code based instruction to do so.
+
+##### notebooks:
+
+```runnable``` allows notebook as tasks. Notebooks can take JSON style inputs and can return
+pythonic objects for downstream steps.
+
+```metaflow``` does not support notebooks as tasks.
+
+##### infrastructure:
+
+```runnable```, in many ways, is just a transpiler to your chosen infrastructure.
+
+```metaflow``` is a platform with its own specified infrastructure.
+
+##### modular pipelines
+
+In ```runnable``` the individual pipelines of parallel and map states are
+pipelines themselves and can run in isolation. This is not true in ```metaflow```.
+
+##### unit testing pipelines
+
+```runnable``` pipelines are testable using ```mocked``` executor where the executables can be mocked/patched. In ```metaflow```, it depends on how the
+python function is wrapped in the pipeline.
+
+
+### kedro
+
+The function in ```kedro``` implementation would roughly be:
+
+Note that any movement of files should happen via data catalog.
+
+```python
+from kedro.pipeline import Pipeline, node, pipeline
+from somewhere import func
+
+def create_pipeline(**kwargs) -> Pipeline:
+    return pipeline(
+        [
+            node(
+                func=func,
+                inputs=["params:x", "y"],
+                outputs=["z"],
+                name="my_function",
+            ),
+            ...
+        ]
+    )
+
+```
+
+##### Structure
+
+Kedro needs a structure and configuration to set up a new project and provides
+a CLI to get started.
+
+To use ```runnable``` as part of the project requires
+adding a pipeline definition file (in python or yaml) and an optional configuration file.
+
+##### dataflow
+
+Kedro needs the data flowing through the pipeline via catalog.yaml which
+provides a central place to understand the data.
+
+In ```runnable```, the data is presented to the individual tasks as
+requested by the ```catalog``` instruction.
 
-The differences between both:
+##### notebooks
 
-- dependency management in ```runnable``` is expected to be user driven and "pythonic" (venv, poetry etc) while metaflow packages provides a per-step (or flow level) libraries.
-- data flow between steps is by an instruction in runnable to ```glob``` files in
-local disk and present them in the same structure to downstream steps. ```metaflow``` needs this to be via code.
-- runnable allows ```notebooks``` to be a task, allowing simple data types to be
-parameters while collecting objects from the notebook execution.
-- metaflow has a *platform* side to it and comes up with prescribed infrastructure while runnable is, in many ways, a transpiler to your chosen infrastructure.
+Kedro supports notebooks for exploration but not as tasks of the pipeline.