Validate consumes and infer produces #806

mrchtr · 2024-01-22T16:26:12Z

First draft implementation to resolve #752

I've added the schema infer method using pandera into the PandasTransformComponent class. The classmethod is used to execute the transform and return the produced schema. It felt like the right place because we can test it independently. In the pipeline.py, we can call the method if it is implemented.

I still have to implement the same for DaskTransformComponent and fix the test. I wanted to get some feedback if this goes into the right direction, before I continue.

PhilippeMoussalli · 2024-01-23T09:12:14Z

src/fondant/component/component.py

+
+        input_df = input_schema.example(size=5)
+        output_df = cls(consumes=consumes, produces={}).transform(dataframe=input_df)
+        output_schema = pandera.infer_schema(output_df)


I guess the simulation is happening by passing a small subset of the data through the transform function. This can be problematic in two cases:

How well is pandera able to simulate all the different transforms? Can we try this approach with some of our current existing components and see if it matches the expectations?

This would require all the dependencies required for the transform function to be installed during compile time which is simply not feasible

Before going ahead with the implementation, might be best to check on both of those assumptions. Otherwise we might just need to opt for another approach

I've added a try/catch block and a warning if module is not available.
An user would have to provide a produces schema manually if the schema can't be detected automatically due to missing requirements.

…umes-and-infer-produces

PhilippeMoussalli

Thanks Matthias! Left a few comments

PhilippeMoussalli · 2024-01-30T14:09:22Z

src/fondant/pipeline/pipeline.py

@@ -220,12 +220,27 @@ def from_ref(cls, ref: t.Any, **kwargs) -> "ComponentOp":
                image = ref.image()
                description = ref.__doc__ or "lightweight component"

+                # Try to determine produce for TransformComponents


Suggested change

# Try to determine produce for TransformComponents

# Try to determine produce for PandasTransformComponents

PhilippeMoussalli · 2024-01-30T14:13:13Z

src/fondant/pipeline/pipeline.py

+                # Try to determine produce for TransformComponents
+                if hasattr(cls, "resolve_produce"):
+                    try:
+                        produces = cls.resolve_produce(consumes=kwargs["consumes"])


I think we should only attempt to resolve it if no produces argument is passed to the component. Would also return valid logs to the user that it is inferred and return the inferred schema

mrchtr · 2024-02-15T09:40:24Z

Will be revisit during the implementation of #830

mrchtr added 2 commits January 22, 2024 15:23

Drafting infer produces schema

3d3483d

Drafting infer produces schema

6575423

mrchtr requested review from RobbeSneyders, GeorgesLorre and PhilippeMoussalli January 22, 2024 16:26

PhilippeMoussalli reviewed Jan 23, 2024

View reviewed changes

mrchtr added 2 commits January 29, 2024 10:09

Merge remote-tracking branch 'origin/main' into feature/validate-cons…

18f357e

…umes-and-infer-produces

Add pyarrow type transformation

29a9b43

mrchtr mentioned this pull request Jan 30, 2024

Validate consumes and infer produces for Lightweight Python components #752

Open

PhilippeMoussalli reviewed Jan 30, 2024

View reviewed changes

mrchtr mentioned this pull request Feb 15, 2024

Enable eager execution #830

Open

mrchtr closed this Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate consumes and infer produces #806

Validate consumes and infer produces #806

mrchtr commented Jan 22, 2024

PhilippeMoussalli Jan 23, 2024

mrchtr Jan 30, 2024

PhilippeMoussalli left a comment

PhilippeMoussalli Jan 30, 2024

PhilippeMoussalli Jan 30, 2024

mrchtr commented Feb 15, 2024

	# Try to determine produce for TransformComponents
	# Try to determine produce for PandasTransformComponents

Validate consumes and infer produces #806

Validate consumes and infer produces #806

Conversation

mrchtr commented Jan 22, 2024

PhilippeMoussalli Jan 23, 2024

Choose a reason for hiding this comment

mrchtr Jan 30, 2024

Choose a reason for hiding this comment

PhilippeMoussalli left a comment

Choose a reason for hiding this comment

PhilippeMoussalli Jan 30, 2024

Choose a reason for hiding this comment

PhilippeMoussalli Jan 30, 2024

Choose a reason for hiding this comment

mrchtr commented Feb 15, 2024