-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parquet)!: coerce_types flag for date64 #6313
base: master
Are you sure you want to change the base?
feat(parquet)!: coerce_types flag for date64 #6313
Conversation
Flagging @tustvold who filed the original issue #1938 Thank you @dsgibbons |
@tustvold I know this isn't high priority, but just bumping this so I can get some feedback before working on accompanying PRs. |
@dsgibbons do you have a usecase for this feature? There is a workaround which is for the user to explicitly call I wonder if you need this feature yourself or if you are trying to help workdown the arrow-rs backlog |
The latter |
Thank you. In that case I will let @tustvold comment as he filed the original ticket and I have no additional context |
IIRC the request came from people coming from arrow2 which did something similar. I'm afraid I've lost most context on this and don't have time at the moment to review, but I'm sure one of the other maintainers will be able to advise on a path forward |
I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more |
Thanks for taking this on @dsgibbons. I may be confused, but it appears that the approach you use for the non-coerced case is to write DataType::Date64 => {
if coerce_types {
Type::primitive_type_builder(name, PhysicalType::INT32)
.with_logical_type(Some(LogicalType::Date))
.with_repetition(repetition)
.with_id(id)
.build()
} else {
Type::primitive_type_builder(name, PhysicalType::INT64)
.with_repetition(repetition)
.with_id(id)
.build()
}
}, I don't think the write side is too far off. On the read side, I think you'll still have to account for the |
Thank you for taking the time to look at this @etseidl. I'm still new to the project so I have plenty to learn. From #1938:
I think I interpreted this as the Parquet
So if we can't embed the fact that the field refers to a date in the Parquet I thought that all type information was inferred from the Parquet file. I mistakenly removed the On another note, are you OK with the breaking change introduced by: |
👋 Welcome! I'm pretty new here too, and still learning as well 😄.
Yes. The Parquet format allows for key-value pairs in the metadata. What the arrow writers will do for certain types that aren't representable in Parquet (most of them time based), is to embed a base64 encoded arrow schema (which is serialized with flatbuffers) in one of these key/value pairs, with the key being My hope is that the arrow schema will preserve the fact that in arrow it was a Did that make sense? Hopefully someone with more arrow-side experience will correct me if I've messed up some details.
Breaking changes are ok, but with the caveat that they're only allowed in major releases (with the next 54.0.0 release slated for December). IIUC what will happen is once this is PR ready for merge, it will be changed to "Draft" until development on 54.0.0 commences. I'll take another look and see if there's a way to avoid breaking the API. Perhaps new functions that accept the |
Makes sense. I didn't know about |
I finally got a chance to update this. I think this is correct now. Fortunately, this PR is a bit shorter now because some of the previous unit tests no longer apply. I'm not too sure what to do with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. Looking good so far! I still have to look at statistics.
parquet/src/file/properties.rs
Outdated
@@ -279,6 +282,13 @@ impl WriterProperties { | |||
self.statistics_truncate_length | |||
} | |||
|
|||
/// Returns `coerce_types` boolean | |||
/// | |||
/// `true` if type coercion enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add some words about what type coercion does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
[<$stat_type_prefix Int32StatsIterator>]::new($iterator) | ||
.map(|x| x.map(|x| i64::from(*x) * 24 * 60 * 60 * 1000)), | ||
[<$stat_type_prefix Int64StatsIterator>]::new($iterator).map(|x| x.copied()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Statistics
struct keeps the parquet physical type info, but I'm not sure how to get at it here. 🤔
We have to know whether the physical type is INT32 or INT64, and if INT32 we still have to do the scaling of days to milliseconds (and use the Int32StatsIterator
). Perhaps add physical type to the StatisticsConverter
struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed this now. Could you please take another look? :)
I think this is an api change so marking it thusly |
a1f0ca6
to
169ba01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good @dsgibbons! Thanks! Just one tiny nit in documentation.
cc @alamb for a second opinion
Co-authored-by: Ed Seidl <[email protected]>
Which issue does this PR close?
Relates to #1938, which will be closed by future PRs.
What changes are included in this PR?
This PR introduces the
coerce_types
flag forWriterProperties
. To start, I've only addressed theDate64
case. The desired behaviour forDate64
is captured in #1938. I've added tests to ensure thatDate64
is handled correctly whencoerce_types=false
andcoerce_types=true
. I've also added some testing aroundDate32
to ensure I haven't accidentally broken anything there.I've deliberately avoided the other types mentioned in #1938 because I wanted to make sure we are happy with how
coerce_types
works forDate64
first. Once accepted, I'll raise PRs for the remaining types to finally close out #1938.One thing missing from this PR is validation on
Date64
. The C++ implementation has afull_validation
option that checks that allDate64
lie on a date boundary (i.e., a multiple of 1000 * 60 * 60 * 24). I've started work on adding this in arrow-rs and intend to raise a separate PR for enablingDate64
validation in thearrow-array
crate.Are there any user-facing changes?
Breaking changes:
arrow_to_parquet_schema(schema: &Schema, coerce_types: bool)
- I wasn't sure how else we could propagate thecoerce_types
option down to the parquet schema.arrow_to_parquet_schema_with_root(schema: &Schema, root: &str, coerce_types: bool)
- same as above.coerce_types
flag inWriterProperties
(not a major issue as typically created viaWriterPropertiesBuilder
).User facing changes:
Date64
logical type.