Update documentation dataset first interface (#921)

Updates the pipeline references in the documentation.
ml6team · Apr 5, 2024 · a1aa2fe · a1aa2fe
1 parent dc7c970
commit a1aa2fe
Show file tree

Hide file tree

Showing 18 changed files with 241 additions and 241 deletions.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -2,7 +2,7 @@
 
 ### Fondant architecture overview
 
-![data explorer](art/architecture.png)
+![fondant architecture](art/architecture.png)
 
 At a high level, Fondant consists of three main parts:
 
@@ -13,7 +13,7 @@ At a high level, Fondant consists of three main parts:
       specifications mainly include the component image location, arguments, columns it consumes and
       produces.
     * `manifest.py` Describes dataset content, facilitating reference passing between components.
-      It evolves during pipeline execution and aids static evaluation.
+      It evolves during a dataset materialization and aids static evaluation.
     * `schema.py`  Defines the Type class, used for dataset data type definition.
     * `/schema` Directory Containing JSON schema specifications for the component spec and manifest.
 
@@ -37,25 +37,25 @@ At a high level, Fondant consists of three main parts:
       component type.
 
 
-* The `/dataset` directory which contains the modules for implementing a Fondant pipeline.
-    * `dataset.py`: Defines the `Dataset` class which is used to define the graph. The
-      implemented class is then consumed by the compiler to compile to a specific runner.
-      This module also implements the
-      `ComponentOp` class which is used to define the component operation in the pipeline graph.
+* The `/dataset` directory which contains the modules for implementing a Fondant dataset.
+    * `dataset.py`: Defines the `Dataset` class which is used to define the workflow graph to 
+      materialize the dataset. The implemented class is then consumed by the compiler to compile  
+      to a specific workflow runner.
+      This module also implements the `ComponentOp` class which is used to define the component 
+      operation in the workflow graph.
     * `compiler.py`: Defines the `Compiler` class which is used to define the compiler that
-      compilers the pipeline graph for a specific
-      runner.
+      compilers the workflow graph for a specific runner.
     * `runner.py`: Defines the `Runner` class which is used to define the runner that executes the
-      compiled pipeline graph.
+      compiled workflow graph.
 
 ### Additional modules
 
 Additional modules in Fondant include:
 
 * `cli.py`: Defines the CLI for interacting with Fondant. This includes the `fondant` command line
   tool which is used to build components,
-  compile and run pipelines and explore datasets.
+  compile and run workflows to materialize and explore datasets.
 * `explore.py`: Runs the explorer which is a web application that allows the user to explore the
   content of a dataset.
 * `build.py`: Defines the `build` command which is used to build and publish a component.
-* `testing.py`:  Contains common testing utilities for testing components and pipelines.
+* `testing.py`:  Contains common testing utilities for testing components and datasets.
diff --git a/docs/art/architecture.png b/docs/art/architecture.png
diff --git a/docs/caching.md b/docs/caching.md
@@ -1,20 +1,20 @@
 ## What is caching?
 
-Fondant supports caching of pipeline executions. If a certain component and its arguments
+Fondant supports caching of workflow executions. If a certain component and its arguments
 are exactly the same as in some previous execution, then its execution can be skipped and the output
 dataset of the previous execution can be used instead.
 
 Caching offers the following benefits:  
 1) **Reduced costs.** Skipping the execution of certain components can help avoid unnecessary costly computations.  
-2) **Faster pipeline runs.** Skipping the execution of certain components results in faster pipeline runs.  
-3) **Faster pipeline development.** Caching allows you develop and test your pipeline faster.  
-4) **Reproducibility.** Caching allows you to reproduce the results of a pipeline run by reusing
-   the outputs of a previous pipeline run.
+2) **Faster workflow runs.** Skipping the execution of certain components results in faster workflow execution.  
+3) **Faster dataset development.** Caching allows you develop and test your datasets faster.  
+4) **Reproducibility.** Caching allows you to reproduce the results of a run by reusing
+   the outputs of a previous run.
 
 !!! note "IMPORTANT"  
 
-      The cached runs are tied to the base path which stores the caching key of previous component runs. 
-      Changing the base path will invalidate the cache of previous executed pipelines.
+      The cached runs are tied to the working directory which stores the caching key of previous component runs. 
+      Changing the orking directory will invalidate the cache of previous materialized datasets.
 
 The caching feature is **enabled** by default. 
 
@@ -23,7 +23,7 @@ The caching feature is **enabled** by default.
 You can turn off execution caching at component level by setting the following:
 
 ```python
-from fondant.pipeline.pipeline import ComponentOp
+from fondant.dataset.dataset import ComponentOp
 
 caption_images_op = ComponentOp(
     component_dir="...",
@@ -35,12 +35,12 @@ caption_images_op = ComponentOp(
 ```
 
 ## How caching works
-When Fondant runs a pipeline, it checks to see whether an execution exists in the base path based on
+When Fondant materializes a dataset, it checks to see whether an execution exists in the working directory based on
 the cache key of each component.
 
 The cache key is defined as the combination of the following:
 
-1) The **pipeline step's inputs.** These inputs include the input arguments' value (if any).
+1) The **operation step's inputs.** These inputs include the input arguments' value (if any).
 
 2) **The component's specification.** This specification includes the image tag and the fields
    consumed and produced by each component.
@@ -51,11 +51,10 @@ The cache key is defined as the combination of the following:
 If there is a matching execution in the base path (checked based on the output manifests),
 the outputs of that execution are used and the step computation is skipped.
 
-Additionally, only the pipelines with the same pipeline name will share the cache. Caching for
+Additionally, only datasets with the same dataset name will share the cache. Caching for
 components
 with the `latest` image tag is disabled by default. This is because using `latest` image tags can
 lead to unpredictable behavior due to
-image updates. Moreover, if one component in the pipeline is not cached then caching will be
-disabled for all
-subsequent components.
+image updates. Moreover, if one component in the dataset is not cached then caching will be
+disabled for all subsequent components.
 
diff --git a/docs/components/component_spec.md b/docs/components/component_spec.md
@@ -165,7 +165,7 @@ in the component specification, so we will need to specify the schema of the
 fields when defining the components
 
 ```python
-dataset = Dataset.read(
+dataset = Dataset.create(
     "load_from_csv",
     arguments={
         "dataset_uri": "path/to/dataset.csv",
@@ -196,7 +196,7 @@ by the next component. We can either load the `image` field:
 
 ```python 
 
-dataset = Dataset.read(
+dataset = Dataset.create(
     "load_from_csv",
     arguments={
         "dataset_uri": "path/to/dataset.csv",
@@ -219,7 +219,7 @@ or the `embedding` field:
 
 ```python 
 
-dataset = Dataset.read(
+dataset = Dataset.create(
     "load_from_csv",
     arguments={
         "dataset_uri": "path/to/dataset.csv",
@@ -268,7 +268,7 @@ These arguments are passed in when the component is instantiated.
 If an argument is not explicitly provided, the default value will be used instead if available.
 
 ```python
-dataset = Dataset.read(
+dataset = pipeline.read(
     "custom_component",
     arguments={
         "custom_argument": "foo"

diff --git a/docs/components/components.md b/docs/components/components.md
@@ -2,7 +2,7 @@ from distributed import Client
 
 # Components
 
-Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
+Fondant makes it easy to build dataset collaborative leveraging reusable components. Fondant
 provides a lot of components out of the box
 ([overview](https://fondant.ai/en/latest/components/hub/)), but you can also define your
 own custom components.
@@ -20,9 +20,9 @@ The logic should be implemented as a class, inheriting from one of the base `Com
 offered by Fondant.
 There are three large types of components:
 
-- **`LoadComponent`**: Load data into your pipeline from an external data source
-- **`TransformComponent`**: Implement a single transformation step in your pipeline
-- **`WriteComponent`**: Write the results of your pipeline to an external data sink
+- **`LoadComponent`**: Load data and initialise a dataset from an external data source
+- **`TransformComponent`**: Implement a single transformation step to transform data in your dataset
+- **`WriteComponent`**: Write your dataset to an external data sink
 
 The easiest way to implement a `TransformComponent` is to subclass the provided
 `PandasTransformComponent`. This component streams your data and offers it in memory-sized
@@ -124,7 +124,7 @@ implements the logic of your component.
 
 ```python
 from fondant.component import PandasTransformComponent
-from fondant.pipeline import  lightweight_component
+from fondant.dataset import  lightweight_component
 import pandas as pd
 import pyarrow as pa
 
@@ -138,10 +138,10 @@ class AddNumber(PandasTransformComponent):
         return dataframe
 ```
 
-You can add a custom component to your pipeline by passing in the reference to the component class containing 
+You can apply a custom component to your dataset by passing in the reference to the component class containing 
 your script. 
 
-```python title="pipeline.py"
+```python title="dataset.py"
 _ = dataset.apply(
     ref=AddNumber,
     produces={"x": pa.int32()},
@@ -167,7 +167,7 @@ A typical file structure for a custom component looks like this:
 |     |- Dockerfile
 |     |- fondant_component.yaml
 |     |- requirements.txt
-|- pipeline.py
+|- dataset.py
 ```
 
 The `Dockerfile` is used to build the code into a docker image, which is then referred to in the 
@@ -179,10 +179,10 @@ description: This is a custom component
 image: custom_component:latest
 ```
 
-You can add a custom component to your pipeline by passing in the path to the directory containing 
+You can apply a custom component to your dataset by passing in the path to the directory containing 
 your `fondant_component.yaml`.
 
-```python title="pipeline.py"
+```python title="dataset.py"
 
 dataset = dataset.apply(
   component_dir="components/custom_component",
@@ -198,7 +198,7 @@ See our [best practices on creating a containerized component](../components/con
 ### Reusable components
 
 Reusable components are out of the box containerized components from the Fondant Hub that you can easily add 
-to your pipeline:
+to your dataset:
 
 ```python
 

diff --git a/docs/components/containerized_components.md b/docs/components/containerized_components.md
@@ -1,6 +1,6 @@
 # Creating containerized components
 
-Fondant makes it easy to build data preparation pipelines leveraging reusable components. Fondant
+Fondant makes it easy to build dataset collaborative leveraging reusable components. Fondant
 provides a lot
 of [components out of the box](https://fondant.ai/en/latest/components/hub/), but you can also
 define your own containerized components.
@@ -79,6 +79,6 @@ transformers==4.29.2
 ```
 
 Refer to this [section](publishing_components.md) to find out how to build and publish your components to use them in 
-your own pipelines.
+your own dataset workflows.
 
 
diff --git a/docs/components/lightweight_components.md b/docs/components/lightweight_components.md
@@ -1,17 +1,17 @@
 # Creating lightweight components
 
-Lightweight components are a great way to implement custom data processing steps in your pipeline. 
-They are easy to implement and can be reused across different pipelines. If you want to 
+Lightweight components are a great way to implement custom data processing steps in your dataset workflows. 
+They are easy to implement and can be reused across different datasets. If you want to 
 build more complex components that require additional dependencies (e.g. GPU support), you can
 also build a containerized component. See the [containerized component guide](../components/containerized_components.md) for more info.
 
 To implement a lightweight component, you simply need to create a python script that implements 
-the component logic. Here is an example of a pipeline composed of two custom components,
+the component logic. Here is an example of a dataset composed of two custom components,
 one that creates a dataset and one that adds a number to a column of the dataset:
 
-```python title="pipeline.py"
+```python title="dataset.py"
 from fondant.component import DaskLoadComponent, PandasTransformComponent
-from fondant.pipeline import lightweight_component
+from fondant.dataset import lightweight_component
 import dask.dataframe as dd
 import pandas as pd
 import pyarrow as pa
@@ -42,31 +42,26 @@ Notice that we use the `@lightweight_component` decorator to define our componen
 is used to package the component into a containerized component and can also be used to 
 define additional functionalities.
 
-To register those components to a pipeline, we can use the `read` and `apply` method for the 
+To register those components to a dataset, we can use the `create` and `apply` method for the 
 first and second component respectively:
 
-```python title="pipeline.py"
-from fondant.pipeline import Pipeline
+```python title="datast.py"
+from fondant.dataset import Dataset
 
-pipeline = Pipeline(
-    name="dummy-pipeline",
-    base_path="./data",
-)
-
-dataset = Dataset.read(
+dataset = Dataset.create(
     ref=CreateData,
+    dataset_name="dummy-pipeline",
 )
-
 _ = dataset.apply(
     ref=AddNumber,
     arguments={"n": 1},
 )
 ```
 
-Here we are creating a pipeline that reads data from the `CreateData` component and then applies
+Here we are creating a dataset workflow that reads data from the `CreateData` component and then applies
 the `AddNumber` component to it. The `produces` argument is used to define the schema of the output
 of the component. This is used to validate the output of the component and to define the schema
-of the next component in the pipeline.
+of the next component in the dataset.
 
 Behind the scenes, Fondant will automatically package the component into a containerized component that
 uses a base image with the current installed Fondant and python version.
@@ -77,15 +72,15 @@ If you want to install additional requirements for your component, you can do so
 package to the `extra_requires` argument of the `@lightweight_component` decorator. This will
 install the package in the containerized component.
 
-```python title="pipeline.py"
+```python title="dataset.py"
 @lightweight_component(extra_requires=["numpy"])
 ```
 
 Under the hood, we are injecting the source to a docker container. If you want to use additional 
 dependencies, you have to make sure to import the libaries inside a function directly.
 
 For example: 
-```python title="pipeline.py"
+```python title="dataset.py"
 ...
 def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
     import numpy as np
@@ -100,7 +95,7 @@ If you want to change the base image of the containerized component, you can do
 image instead of the default one. Make sure you install Fondant in the base image or list it 
 in the `extra_requires` argument.
 
-```python title="pipeline.py"
+```python title="dataset.py"
 @lightweight_component(base_image="python:3.10-slim")
 ```
 
@@ -111,7 +106,7 @@ of the decorator.
 If we take the previous example, we can restrict the columns that are loaded by the `AddNumber` component
 by specifying the `x` column in the `consumes` argument:
 
-```python title="pipeline.py"
+```python title="dataset.py"
 @lightweight_component(
     consumes={
     "x": pa.int32()
@@ -136,7 +131,7 @@ it to containerized component. See the [containerized component guide](../compon
 
 You can also choose to load in dynamic fields by setting the `additionalProperties` argument to `True` in the `consumes` argument.   
 
-This will allow you to define an arbitrary number of columns to be loaded when applying your component to the pipeline.  
+This will allow you to define an arbitrary number of columns to be loaded when applying your component to the dataset.  
 
 This can be useful in scenarios when we want to dynamically load in fields from a dataset. For example, if we want to aggregate results 
 from multiple columns, we can define a component that loads in specific column from the previous component and then aggregates them.   
@@ -147,7 +142,7 @@ the `x` and `z` columns into a new column `score`:
 ```python
 import dask.dataframe as dd
 from fondant.component import PandasTransformComponent
-from fondant.pipeline import lightweight_component
+from fondant.dataset import lightweight_component
 
 @lightweight_component(
     consumes={

diff --git a/docs/components/publishing_components.md b/docs/components/publishing_components.md
@@ -31,7 +31,7 @@ component is located.
 
 The tag arguments is used to specify the Docker container tag. When specified, the tag in the
 referenced component specification yaml will also be
-updated, ensuring that the next pipeline run correctly references the image.
+updated, ensuring that the next dataset workflow run correctly references the image.
 
 
 !!! note "IMPORTANT"