Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merge schema #2229

Closed
wants to merge 44 commits into from
Closed

feat: merge schema #2229

wants to merge 44 commits into from

Conversation

aersam
Copy link
Contributor

@aersam aersam commented Feb 28, 2024

Description

It's a draft currently. This replaces the old "overwrite_schema" parameter with a schema_write_mode parameter that basically allows to distinguish between overwrite/merge/none

Related Issue(s)

Fixes #1386

TODO

  • Get Partitioning Test through
  • Test Overwrite Schema in Append (not very meaningful, I guess, but why not?)
  • Test Python

Documentation

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Feb 28, 2024
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@aersam aersam marked this pull request as draft February 28, 2024 15:48
@aersam aersam changed the title Append python feat: merge schema Feb 28, 2024
.into_iter()
.map(|a| match a {
Action::Add(a) => a,
_ => panic!("Expected Add action"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I allowed to do that? :)

@ion-elgreco
Copy link
Collaborator

I'll take a look later this evening! And thanks for picking this up🤗

Comment on lines 93 to 100
pub enum SchemaWriteMode {
/// Use existing schema and fail if it does not match the new schema
None,
/// Overwrite the schema with the new schema
Overwrite,
/// Append the new schema to the existing schema
Merge,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already some schema evolution code inside of the root writer module, would you be up for consolidating some of this code? It's been on my todo list to bring this write operation and the RecordBatch/JsonWriters into alignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first looked at the code I asked myself, why there are so many writer implementations? Actually I think there should be only one writer, which is the record_batch writer. The Json Writer should just be producing record batches and use the record batch writer and the writer operation should also use the record batch writer. But maybe I oversee something here and it's probably a bigger undertaking? And also it seems the RecordBatch Writer does not have Overwrite schema yet, right?

A lot simpler would be to use the existing WriteMode enum to not duplicate there. And maybe make a separate file with the schema merge logic which then also handles the partition columns.

What level of consolidating did you had in mind?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtyler I just tried what happens if I delete writer.rs and try to replace it with RecordBatchWriter in this PR: #2232 It seems to be possible. Would you generally prefer that approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having tried some stuff I came to conclusion that this is a bigger project I'm not sure I'm the right one to do, since I don't really see the use case for RecordBatchWriter and how the connection of writer.rs to DataFusion is an issue. I'd prefer having this merged first and then discuss further code deduplication

@aersam
Copy link
Contributor Author

aersam commented Feb 29, 2024

@ion-elgreco Thanks for your feedback, I've done those changes. However for me the next step is to elaborate if this approach is the one to take or go for the more radical approach in #2232 where I just completely dropped writer.rs in favor of RecordBatchWriter. What do you think? #2232 is more of an experiment yet, but basically it seems possible

@ion-elgreco
Copy link
Collaborator

@aersam maybe take the writer.rs as the default template, extend it with the merge functionality and move that then into RecordBatchWriter? It seems to be the newest writer, so I would assume there are some changes in there that are improvments over the RecordBatchWriter

@aersam
Copy link
Contributor Author

aersam commented Mar 1, 2024

@aersam maybe take the writer.rs as the default template, extend it with the merge functionality and move that then into RecordBatchWriter? It seems to be the newest writer, so I would assume there are some changes in there that are improvments over the RecordBatchWriter

Yes, seems like. Better would be to remove RecordBatchWriter, but this seems to be a public Rust API, so would be breaking. Will continue this here

@aersam
Copy link
Contributor Author

aersam commented Mar 4, 2024

@aersam great! I'll do a thorough review later tonight after work

Good! Just did another push that should also make the pyarrow 8 tests go through, sorry :)

docs/usage/writing/index.md Show resolved Hide resolved
crates/core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/core/src/operations/write.rs Show resolved Hide resolved
crates/core/src/operations/write.rs Outdated Show resolved Hide resolved
crates/core/src/operations/cast.rs Show resolved Hide resolved
crates/core/src/operations/cast.rs Show resolved Hide resolved
@aersam aersam requested a review from ion-elgreco March 5, 2024 05:27
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small comments on tests but after that we are good to go

python/tests/test_writer.py Outdated Show resolved Hide resolved
python/tests/test_writer.py Show resolved Hide resolved
@aersam
Copy link
Contributor Author

aersam commented Mar 5, 2024

I dont think the failure is related to my changes?

@ion-elgreco
Copy link
Collaborator

@aersam it's not, not sure what's going here maybe credentials expired?

@aersam
Copy link
Contributor Author

aersam commented Mar 5, 2024

@aersam it's not, not sure what's going here maybe credentials expired?

I'd suspect an update of object_store that was breaking in terms of the azure emulator. Suspect commit: apache/arrow-rs@0fd9b8a .

I cannot make that test run locally, but I'll try to fix object_store to use 0.9.0 and retry

@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Mar 5, 2024

@aersam it's not, not sure what's going here maybe credentials expired?

I'd suspect an update of object_store that was breaking in terms of the azure emulator. Suspect commit: apache/arrow-rs@0fd9b8a .

I cannot make that test run locally, but I'll try to fix object_store to use 0.9.0 and retry

Good catch, I was already wondering, how did anything change there. Can you maybe do another small PR that pins it so that it unblocks also other PRs

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition :) I think many folks will be happy this feature has landed! Approved from my side

Let's have @rtyler also gives his blessing before merging🙏

Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to see this come in. I agree it's a non-trivial effort to consolidate our writer support. The write module here is just growing to be excessive and unmaintainable unfortunately. prepare_predicate_actions() as an example is now almost 400 lines of deeply important logic smashed together which does concern me, but that's not something new 😄

ion-elgreco pushed a commit that referenced this pull request Mar 5, 2024
# Description
See
#2229 (comment)

This fixes object_store to use 0.9.0 for now,
@rtyler
Copy link
Member

rtyler commented Mar 5, 2024

I am rebasing and will squash this for a merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schema evolution mergeSchema support
3 participants