diff --git a/README.md b/README.md index 303701546..8349ab6f3 100644 --- a/README.md +++ b/README.md @@ -48,6 +48,10 @@ And create high quality datasets to fine-tune your own foundation models.
+## 💨 Getting Started + +Anxious to get started? Here's is a [step by step guide](https://fondant.readthedocs.io/en/latest/getting-started) to get your first pipeline up and running. + ## 🪄 Example pipelines Curious to see what Fondant can do? Have a look at our example pipelines: @@ -90,6 +94,7 @@ point to create datasets for training code assistants. + ## 🧩 Reusable components Fondant comes with a library of reusable components, which can jumpstart your pipeline. diff --git a/components/image_resolution_extraction/fondant_component.yaml b/components/image_resolution_extraction/fondant_component.yaml index 3ded47331..cc917bad2 100644 --- a/components/image_resolution_extraction/fondant_component.yaml +++ b/components/image_resolution_extraction/fondant_component.yaml @@ -11,9 +11,9 @@ consumes: produces: images: fields: + data: + type: binary width: type: int16 height: - type: int16 - data: - type: binary \ No newline at end of file + type: int16 \ No newline at end of file diff --git a/components/image_resolution_extraction/src/main.py b/components/image_resolution_extraction/src/main.py index 8b0f03314..e1be245c2 100644 --- a/components/image_resolution_extraction/src/main.py +++ b/components/image_resolution_extraction/src/main.py @@ -39,8 +39,9 @@ def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame: """ logger.info("Filtering dataset...") - dataframe[[("images", "width"), ("images", "height")]] = \ - dataframe[[("images", "data")]].map(extract_dimensions) + dataframe[[("images", "width"), ("images", "height")]] = dataframe[ + [("images", "data")] + ].apply(lambda x: extract_dimensions(x.images.data), axis=1) return dataframe diff --git a/docs/getting_started.md b/docs/getting_started.md new file mode 100644 index 000000000..5ada9c2b2 --- /dev/null +++ b/docs/getting_started.md @@ -0,0 +1,310 @@ + + +## Setting up the environment + +### Installing fondant + +We suggest that you use a [virtual environment](https://docs.python.org/3/library/venv.html) for your project. Fondant supports Python >=3.8. + +Fondant can be installed using pip: + +```bash +pip install fondant[pipelines] +``` + +You can validate the installation of fondant by running its root CLI command: +```bash +fondant --help +``` + +## Your first pipeline + +Create a `pipeline.py` file in the root of your project and add the following code: + +```Python +from fondant.pipeline import Pipeline, ComponentOp + +my_pipeline = Pipeline( + pipeline_name='my_pipeline', + base_path='/home/username/my_pipeline', # TODO: update this + pipeline_description='This is my pipeline', +) +``` + +This is all you need to initialize a fondant pipeline: + +- **pipeline_name**: A name to reference your pipeline. +- **base_path**: The base path that fondant should use to store artifacts and data. This base_path can be a local path or a cloud storage path (e.g. s3://my_bucket/artifacts, or gs://my_bucket/artifacts). +- **pipeline_description**: A description of your pipeline. + +## Adding components + +Now that we have a pipeline, we can add components to it. Components are the building blocks of your pipeline. They are the individual steps that will be executed in your pipeline. There are 2 main types of components: + +- **reusable components**: These are components that are already created by the community and can be easily used in your pipeline. You can find a list of reusable components [here](https://github.com/ml6team/fondant/tree/main/components). They often have arguments that you can set to configure them for your use case. + +- **custom components**: These are the components you create to solve your use case. A custom component can be easily created by adding a `fondant_component.yaml`, `dockerfile` and `main.py` file to your component subdirectory. The `fondant_component.yaml` file contains the specification of your component. You can find more information about it [here](https://github.com/ml6team/fondant/blob/main/docs/component_spec.md) + +Let's add a reusable component to our pipeline. We will use the `load_from_hf_hub` component to read data from huggingface. Add the following code to your `pipeline.py` file: + +```Python +load_from_hf_hub = ComponentOp.from_registry( + name='load_from_hf_hub', + component_spec_path='components/load_from_hf_hub/fondant_component.yml', + arguments={ + 'dataset_name': 'huggan/pokemon', + 'n_rows_to_load': 100, + 'column_name_mapping': { + 'image': 'images_data', + }, + "image_column_names": ["image"], + + } +) + +my_pipeline.add_op(load_from_hf_hub, dependencies=[]) +``` + +Two things are happening here: +1. We created a ComponentOp from the registry. This is a reusable component, we pass it arguments needed to configure it + +- **dataset_name**: The name of the dataset on huggingface hub, here we load a [dataset with pokemon images](https://huggingface.co/datasets/huggan/pokemon) +- **n_rows_to_load**: The number of rows to load from the dataset. This is useful for testing your pipeline on a small scale. +- **column_name_mapping**: A mapping of the columns in the dataset to the columns in fondant. Here we map the `image` column in the dataset to the `images_raw` subset_column in fondant. +- **image_column_names**: A list of the original image column names in the dataset. This is used to format the image from the huggingface hub format to a byte string. + + +2. We add our created componentOp to the pipeline using the `add_op` method. This component has no dependencies since it is the first component in our pipeline. + + +Next create a file `load_from_hf_hub/fondant_component.yml` with the following content: + +```yaml +name: Load from hub +description: Component that loads a dataset from the hub +image: ghcr.io/ml6team/load_from_hf_hub:latest + +produces: + images: # subset name + fields: + data: # field name + type: binary # field type + + +args: + dataset_name: + description: Name of dataset on the hub + type: str + column_name_mapping: + description: Mapping of the consumed hub dataset to fondant column names + type: dict + image_column_names: + description: Optional argument, a list containing the original image column names in case the + dataset on the hub contains them. Used to format the image from HF hub format to a byte string. + type: list + default: None + n_rows_to_load: + description: Optional argument that defines the number of rows to load. Useful for testing pipeline runs on a small scale + type: int + default: None +``` + +This is the component spec of the component we have just added to our pipelines, the only thing we have altered is the `produces` section. We have defined what subsets, fields and types this component produces. + +Your project should look like this now: +``` +. +├── components +│ └── load_from_hf_hub +│ └── fondant_component.yml +└── pipeline.py +``` + +We now have a fully functional Fondant pipeline, it does not have much functionality yet, but it is a good starting point to build upon. We can already try running this limited example in order to validate our setup. + +## Running your pipeline + +A Fondant pipeline needs to be compiled before it can be ran. This means translating the user friendly Fondant pipeline definition into something that can be executed by a runner. + +There are currently 2 runners available: +- Local runner: This runner runs the pipeline locally on your machine. This is useful for testing your pipeline. We leverage Docker Compose to compile and run the pipeline locally. +- Kubeflow runner: This runner runs the pipeline on a Kubeflow cluster. This is useful for running your pipeline in production on full data. + +Fondant has a feature rich CLI that helps you with these steps, let's start by runnin our pipeline with the local runner: + +```bash +fondant run pipeline:my_pipeline --local +``` + +We call the fondant CLI to compile and run our pipeline, we pass a reference to our pipeline using the import_string syntax `