-
Notifications
You must be signed in to change notification settings - Fork 25
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[DataComp] Add download images component (#348)
This PR: splits up the datacomp folder into 2 pipelines: - a simple pipeline, just consisting of 3 components, serving as a simple baseline which could serve as a first submission - a more advanced pipeline, which involves downloading images (using the reusable `download_images` component), and later on also text detection and text recognition and improves the `download_images` component to leverage Dask's `map_partitions`. --------- Co-authored-by: Robbe Sneyders <[email protected]>
- Loading branch information
1 parent
2160ff3
commit c4687d9
Showing
10 changed files
with
225 additions
and
58 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# DataComp pipeline | ||
|
||
[DataComp](https://www.datacomp.ai/) is a competition organized by the University of Washington and others to come up with the best possible image-text dataset to train a fixed CLIP model. Hence, it's an ideal use case for Fondant, as we can leverage reusable components to filter large, noisy image-text datasets. | ||
|
||
Currently, 2 pipelines are implemented: | ||
|
||
- a simple pipeline (`simple_pipeline.py`), which loads the DataComp dataset from the hub and applies 2 basic filtering steps (filtering on image resolution and caption complexity). This pipeline serves as a baseline and could serve as a first submission. | ||
- a more complex pipeline (`pipeline.py`), which loads the DataComp dataset from the hub, loads the actual images based on the URLs, and applies text detection and text recognition models to filter the dataset. |
61 changes: 61 additions & 0 deletions
61
examples/pipelines/datacomp/components/download_images/fondant_component.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
name: Download images | ||
description: Component that downloads images based on URLs | ||
image: ghcr.io/ml6team/download_images:dev | ||
|
||
consumes: | ||
images: | ||
fields: | ||
url: | ||
type: string | ||
width: | ||
type: int32 | ||
height: | ||
type: int32 | ||
face_bboxes: | ||
type: array | ||
items: | ||
type: array | ||
items: | ||
type: float32 | ||
sha256: | ||
type: utf8 | ||
|
||
produces: | ||
images: | ||
fields: | ||
data: | ||
type: binary | ||
width: | ||
type: int32 | ||
height: | ||
type: int32 | ||
|
||
args: | ||
timeout: | ||
description: Maximum time (in seconds) to wait when trying to download an image | ||
type: int | ||
default: 10 | ||
retries: | ||
description: Number of times to retry downloading an image if it fails. | ||
type: int | ||
default: 0 | ||
image_size: | ||
description: Size of the images after resizing. | ||
type: int | ||
default: 256 | ||
resize_mode: | ||
description: Resize mode to use. One of "no", "keep_ratio", "center_crop", "border". | ||
type: str | ||
default: 'border' | ||
resize_only_if_bigger: | ||
description: If True, resize only if image is bigger than image_size. | ||
type: bool | ||
default: 'False' | ||
min_image_size: | ||
description: Minimum size of the images. | ||
type: int | ||
default: 0 | ||
max_aspect_ratio: | ||
description: Maximum aspect ratio of the images. | ||
type: float | ||
default: 'inf' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
"""Simplified pipeline used to filter the dataset of the Datacomp competition.""" | ||
|
||
import logging | ||
import sys | ||
|
||
sys.path.append("../") | ||
|
||
from pipeline_configs import PipelineConfigs | ||
|
||
from fondant.pipeline import ComponentOp, Pipeline, Client | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
# Initialize pipeline and client | ||
pipeline = Pipeline( | ||
pipeline_name="datacomp-filtering", | ||
pipeline_description="A pipeline for filtering the Datacomp dataset", | ||
base_path=PipelineConfigs.BASE_PATH, | ||
) | ||
client = Client(host=PipelineConfigs.HOST) | ||
|
||
# define ops | ||
load_component_column_mapping = { | ||
"url": "images_url", | ||
"original_width": "images_width", | ||
"original_height": "images_height", | ||
"face_bboxes": "images_face_bboxes", | ||
"sha256": "images_sha256", | ||
"text": "text_data", | ||
"uid": "image_text_uid", | ||
"clip_b32_similarity_score": "image_text_clip_b32_similarity_score", | ||
"clip_l14_similarity_score": "image_text_clip_l14_similarity_score", | ||
} | ||
|
||
load_from_hub_op = ComponentOp( | ||
component_dir="components/load_from_hf_hub", | ||
arguments={ | ||
"dataset_name": "nielsr/datacomp-small-with-embeddings", | ||
"column_name_mapping": load_component_column_mapping, | ||
}, | ||
node_pool_label="node_pool", | ||
node_pool_name="n2-standard-128-pool", | ||
) | ||
filter_image_resolution_op = ComponentOp.from_registry( | ||
name="filter_image_resolution", | ||
arguments={"min_image_dim": 200, "max_aspect_ratio": 3}, | ||
node_pool_label="node_pool", | ||
node_pool_name="n2-standard-128-pool", | ||
) | ||
filter_complexity_op = ComponentOp( | ||
component_dir="components/filter_text_complexity", | ||
arguments={ | ||
"spacy_pipeline": "en_core_web_sm", | ||
"batch_size": 1000, | ||
"min_complexity": 1, | ||
}, | ||
node_pool_label="node_pool", | ||
node_pool_name="n2-standard-128-pool", | ||
output_partition_size="disable", | ||
) | ||
|
||
# add ops to pipeline | ||
pipeline.add_op(load_from_hub_op) | ||
pipeline.add_op(filter_image_resolution_op, dependencies=load_from_hub_op) | ||
pipeline.add_op(filter_complexity_op, dependencies=filter_image_resolution_op) | ||
# TODO add more ops | ||
|
||
|
||
if __name__ == "__main__": | ||
client.compile_and_run(pipeline=pipeline) |