Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate consumes and infer produces #806

Closed
wants to merge 4 commits into from

Conversation

mrchtr
Copy link
Contributor

@mrchtr mrchtr commented Jan 22, 2024

First draft implementation to resolve #752

I've added the schema infer method using pandera into the PandasTransformComponent class. The classmethod is used to execute the transform and return the produced schema. It felt like the right place because we can test it independently. In the pipeline.py, we can call the method if it is implemented.

I still have to implement the same for DaskTransformComponent and fix the test. I wanted to get some feedback if this goes into the right direction, before I continue.


input_df = input_schema.example(size=5)
output_df = cls(consumes=consumes, produces={}).transform(dataframe=input_df)
output_schema = pandera.infer_schema(output_df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the simulation is happening by passing a small subset of the data through the transform function. This can be problematic in two cases:

  • How well is pandera able to simulate all the different transforms? Can we try this approach with some of our current existing components and see if it matches the expectations?
  • This would require all the dependencies required for the transform function to be installed during compile time which is simply not feasible

Before going ahead with the implementation, might be best to check on both of those assumptions. Otherwise we might just need to opt for another approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a try/catch block and a warning if module is not available.
An user would have to provide a produces schema manually if the schema can't be detected automatically due to missing requirements.

Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matthias! Left a few comments

@@ -220,12 +220,27 @@ def from_ref(cls, ref: t.Any, **kwargs) -> "ComponentOp":
image = ref.image()
description = ref.__doc__ or "lightweight component"

# Try to determine produce for TransformComponents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Try to determine produce for TransformComponents
# Try to determine produce for PandasTransformComponents

# Try to determine produce for TransformComponents
if hasattr(cls, "resolve_produce"):
try:
produces = cls.resolve_produce(consumes=kwargs["consumes"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only attempt to resolve it if no produces argument is passed to the component. Would also return valid logs to the user that it is inferred and return the inferred schema

@mrchtr mrchtr mentioned this pull request Feb 15, 2024
@mrchtr
Copy link
Contributor Author

mrchtr commented Feb 15, 2024

Will be revisit during the implementation of #830

@mrchtr mrchtr closed this Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate consumes and infer produces for Lightweight Python components
2 participants