-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raising errors for pyspark dataframe validation #73
Comments
Hi @michal-mmm, I like the idea of adding In the future, we should evaluate how to handle validations with |
Hi @michal-mmm, could you make the PR with that change? And see what @Galileo-Galilei thinks about it. I am happy to help if you can't |
Hi, sorry for not responding earlier. I think we should go forward. I suggest that we implement in general some kwargs to be passed to the my_dataset:
type: ...
filepath: ...
metadata:
pandera:
schema: ...
validate_kwargs:
lazy: true and then in the hook:
Feel free to open to a PR, and eventually suggest a different design. |
Closed by #78 |
Description
By default,
pandera
does not raise errors forpyspark
DataFrame. Instead, it records validation errors within the df.pandera.errors attribute.e.g.
As per
pandera
documentation:Context
Currently, validating
pyspark
DataFrames directly is not possible, except by manually inspecting thepandera.error
attribute.Possible Implementation
To enforce immediate error raising during validation, one can set
lazy=False
when calling the validation method:metadata["pandera"]["schema"].validate(data, lazy=False)
This setting might be more suitable for machine learning tasks. Alternatively, validation can be toggled off using the environment variable
export PANDERA_VALIDATION_ENABLED=false
, as mentioned in the docs and #27The text was updated successfully, but these errors were encountered: