-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate consumes and infer produces #806
Conversation
src/fondant/component/component.py
Outdated
|
||
input_df = input_schema.example(size=5) | ||
output_df = cls(consumes=consumes, produces={}).transform(dataframe=input_df) | ||
output_schema = pandera.infer_schema(output_df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the simulation is happening by passing a small subset of the data through the transform
function. This can be problematic in two cases:
- How well is pandera able to simulate all the different transforms? Can we try this approach with some of our current existing components and see if it matches the expectations?
- This would require all the dependencies required for the transform function to be installed during compile time which is simply not feasible
Before going ahead with the implementation, might be best to check on both of those assumptions. Otherwise we might just need to opt for another approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a try/catch block and a warning if module is not available.
An user would have to provide a produces schema manually if the schema can't be detected automatically due to missing requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matthias! Left a few comments
@@ -220,12 +220,27 @@ def from_ref(cls, ref: t.Any, **kwargs) -> "ComponentOp": | |||
image = ref.image() | |||
description = ref.__doc__ or "lightweight component" | |||
|
|||
# Try to determine produce for TransformComponents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Try to determine produce for TransformComponents | |
# Try to determine produce for PandasTransformComponents |
# Try to determine produce for TransformComponents | ||
if hasattr(cls, "resolve_produce"): | ||
try: | ||
produces = cls.resolve_produce(consumes=kwargs["consumes"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only attempt to resolve it if no produces
argument is passed to the component. Would also return valid logs to the user that it is inferred and return the inferred schema
Will be revisit during the implementation of #830 |
First draft implementation to resolve #752
I've added the schema infer method using pandera into the PandasTransformComponent class. The classmethod is used to execute the transform and return the produced schema. It felt like the right place because we can test it independently. In the pipeline.py, we can call the method if it is implemented.
I still have to implement the same for DaskTransformComponent and fix the test. I wanted to get some feedback if this goes into the right direction, before I continue.