-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: merge schema #2229
feat: merge schema #2229
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
.into_iter() | ||
.map(|a| match a { | ||
Action::Add(a) => a, | ||
_ => panic!("Expected Add action"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I allowed to do that? :)
I'll take a look later this evening! And thanks for picking this up🤗 |
crates/core/src/operations/write.rs
Outdated
pub enum SchemaWriteMode { | ||
/// Use existing schema and fail if it does not match the new schema | ||
None, | ||
/// Overwrite the schema with the new schema | ||
Overwrite, | ||
/// Append the new schema to the existing schema | ||
Merge, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already some schema evolution code inside of the root writer module, would you be up for consolidating some of this code? It's been on my todo list to bring this write operation and the RecordBatch/JsonWriters into alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first looked at the code I asked myself, why there are so many writer implementations? Actually I think there should be only one writer, which is the record_batch writer. The Json Writer should just be producing record batches and use the record batch writer and the writer operation should also use the record batch writer. But maybe I oversee something here and it's probably a bigger undertaking? And also it seems the RecordBatch Writer does not have Overwrite schema yet, right?
A lot simpler would be to use the existing WriteMode enum to not duplicate there. And maybe make a separate file with the schema merge logic which then also handles the partition columns.
What level of consolidating did you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having tried some stuff I came to conclusion that this is a bigger project I'm not sure I'm the right one to do, since I don't really see the use case for RecordBatchWriter and how the connection of writer.rs to DataFusion is an issue. I'd prefer having this merged first and then discuss further code deduplication
@ion-elgreco Thanks for your feedback, I've done those changes. However for me the next step is to elaborate if this approach is the one to take or go for the more radical approach in #2232 where I just completely dropped writer.rs in favor of RecordBatchWriter. What do you think? #2232 is more of an experiment yet, but basically it seems possible |
@aersam maybe take the writer.rs as the default template, extend it with the merge functionality and move that then into RecordBatchWriter? It seems to be the newest writer, so I would assume there are some changes in there that are improvments over the RecordBatchWriter |
Yes, seems like. Better would be to remove RecordBatchWriter, but this seems to be a public Rust API, so would be breaking. Will continue this here |
Good! Just did another push that should also make the pyarrow 8 tests go through, sorry :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small comments on tests but after that we are good to go
I dont think the failure is related to my changes? |
@aersam it's not, not sure what's going here maybe credentials expired? |
I'd suspect an update of object_store that was breaking in terms of the azure emulator. Suspect commit: apache/arrow-rs@0fd9b8a . I cannot make that test run locally, but I'll try to fix object_store to use 0.9.0 and retry |
Good catch, I was already wondering, how did anything change there. Can you maybe do another small PR that pins it so that it unblocks also other PRs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition :) I think many folks will be happy this feature has landed! Approved from my side
Let's have @rtyler also gives his blessing before merging🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to see this come in. I agree it's a non-trivial effort to consolidate our writer support. The write module here is just growing to be excessive and unmaintainable unfortunately. prepare_predicate_actions()
as an example is now almost 400 lines of deeply important logic smashed together which does concern me, but that's not something new 😄
# Description See #2229 (comment) This fixes object_store to use 0.9.0 for now,
I am rebasing and will squash this for a merge |
Description
It's a draft currently. This replaces the old "overwrite_schema" parameter with a schema_write_mode parameter that basically allows to distinguish between overwrite/merge/none
Related Issue(s)
Fixes #1386
TODO
Documentation