-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compare schema as logically equivalent to workaround disappearing metadata #12631
base: main
Are you sure you want to change the base?
Conversation
I have some suggestion
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable change, but it does, as you say, potentially cause silent errors rather than explicit ones.
This is a good idea for Influx, but maybe not as good an idea for other users of DataFusion
Something we have done in the past for situations like this was to have a config flag that could control the behavior
For example, this flag
datafusion.optimizer.skip_failed_rules
Controls if a failure in an optimizer rule would reject the entire plan or just skip that particular failed rule
Perhaps we could make a similar flag for schema checks like
datafusion.optimizer.only_compare_logical_types
Or something 🤔
Clicked the wrong button -- I don't think we should merge this on by default
Does |
I agree it should also compare the nullability (but not the metadata) Looks like the current code doesn't look at nullability though: https://docs.rs/datafusion-common/42.0.0/src/datafusion_common/dfschema.rs.html#974 |
I filed #12687 to track the underlying issue here. Let's leave this PR as a draft for now until we have a reproducer for the actual problem |
Which issue does this PR close?
This works around some problems noted in #12560. It is not a full fix, but just unblocks some other work for me.
Rationale for this change
This check that is being done (which I am changing) only exists to make sure the schema are logically equivalent - if they are different in the metadata of the fields that they contain, that doesn't affect how they are used later on. So it's more 'correct' to just compare them logically as opposed to fully. Comparing them fully does have the benefit of catching issues with lossy transformations between schema (which is probably what is happening in my case to lose the metadata that I mentioned in #12560), but I think that's a benefit that we can get again once we've figured out where metadata is disappearing.
Are these changes tested?
Yes
Are there any user-facing changes?
No