-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TableMetadataBuilder #587
base: main
Are you sure you want to change the base?
TableMetadataBuilder #587
Conversation
6afeeb0
to
4193131
Compare
ToDos:
|
Fixes #232 |
Thanks @c-thiel for this pr, I've skimmed through it and it looks great to me. However this pr is too huge to review(3k lines), would you mind to split them into smaller onces? For example, we can add one pr for methods involved in one |
Thanks for your Feedback @liurenjie1024. This isn't really a refactoring of the builder, it's more a complete rewrite. The old builder allowed to create corrupt metadata in various ways. Splitting it up by I would currently prefer to keep it as a larger block mainly because:
We now have a vision of what it could look like in the end. Before putting any more effort in, we should answer the following questions:
Those points might change the overall design quite a bit and might require a re-write of After we answered those questions, and we still think splitting makes sense, I can try to find time to build stacked-PRs. Maybe just splitting normalization / validation in |
@liurenjie1024 I tried to cut a few things out - but not along the lines of
After they are all merged, I'll rebase this PR for the actual builder. |
Hi, @c-thiel Sorry for late reply.
I've went through the new builder and I think this is your design is the right direction.
To be honest, I don't quite understand the use case. We can ask for background of this in dev channel, but I think this is not a blocker of this pr, we can always add this later.
I've took a look at the comments of these two prs: apache/iceberg#6701 apache/iceberg#7445 And I think the reason behavior is the |
I agree that this should be required, as I mentioned in #550 |
That sound reasonable to me. If one pr per table update is too much burden, could we split them by components, for example sort oder, partition spec, schema changes? |
@liurenjie1024 thanks for the Feedback!
The problem in changing it later is that it changes the semantic of the function. Right now we expect source_id to match the In my opinion ids are much cleaner than names (we might have dropped and re-added a column with the same name in the meantime), so I am OK with going forward. However, moving over to java semantics will require new endpoints (i.e. Give me a thumbs up if that's OK for you. I'll also open a discussion in the dev channel to get some more opinions. |
I don't think we should add the argument to be honest. My reasoning is as follows: Maybe add @nastra or @Fokko could add some comments on the intention of that parameter? |
I have reviewed most PRs that I am confident can be merged. The only one left is #615, for which I need more input. |
a3c1c89
to
fea1817
Compare
e00450b
to
e36d834
Compare
@Xuanwo, @liurenjie1024 this PR is ready for another round of review. It's now rebased on the 6 PRs we merged during the last months. The core logic is ~1100 lines of code, including quite a bit of comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c-thiel I left a few comments, and suggestions for improvement, LMKWYT. Apart from that it looks great and good to go 👍
return Ok(self); | ||
} | ||
|
||
// ToDo Discuss: Java builds a fresh spec here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just do this once, after that the schema is evolved using the UpdateSchema
, that's implemented by the SchemaUpdate. There you pass in the last-column-id
, and new fields will increment from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is actually a very important point, and I think we should fix rust.
Currently in rust we use the SchemaBuilder
for everything.
To get to it, we use an existing Schema
and use the into_builder(self)
method.
I believe we should modify this to into_builder(self, last_column_id: Option<i64>)
so that users are forced to think about the last_column_id. Most likely they will want to use the last_column_id
of the metadata they got the Schema
from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning toward the other way, where the last-field-id
is tracked by Iceberg-rust itself and nog exposed by the user. The last-schema-id
is a monotonically increasing ID that keeps track of which field-IDs are already being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do it completely internally, then we should probably not allow or discourage into_builder
on schema.
I am unsure how an alternative API should look like. @liurenjie1024 some input would be very helpful.
My thoughts:
- Introduce
TableMetadata.update_schema
that returns the currentSchemaBuilder
, however with a fixedlast-column-id
(set to theTableMetadata
value). We document thatadd_schema
should only be used for schemas modified in this way to ensure thatlast_column_id
is correct. - If documentation is not enough, we could Introduce a new type
SchemaUpdate
that can only be created by callingTableMetadata.update_schema
. It would hold theTableMetadataBuilder
internally and have methods likeadd_field
orremove_field
. Upon build, it would return theTableMetadataBuilder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we create a new table, we also reassign ids with this builder (and also in Lakekeeper).
I think we agree then that we should have a SchemaUpdate
in iceberg-rust as well. The issue has already been created by Fokko. Because we don't have the SchemaUpdate
now, we should keep into_builder()
method of schemas for now.
REST Catalogs will need to use add_schema
anyway even if the schema might be incompatible. This is why I need this method to stay public.
Going forward I propose:
- Keep
add_schema
for now as there is no alternative - Implement Add
SchemaUpdate
logic to Iceberg-Rust #697, add corresponding better methods to theTableMetadataBuilder
to mutate schema. Discourage the use ofadd_schema
in comments, but leave the method public.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that we should keep
add_schema
for now. - I agree that we should implement the issue proposed by @Fokko , which will be part of our trable transaction api.
- I don't think we should discourage the use of
add_schema
since they have different responsibilities.TableMetadataBuilder
do some necessary checks to avoid broke table metadata, but it doesn't guarantee that behavior. We should be careful with our transaction, which should guarantee to produce valid table metadata.
In fact, I'm thinking about make all TableMetadataBuilder
methods crate private. Java makes them public since java's access modifier is limited and doesn't support complex path visiblity like rust. But I think these methods should be called by transaction api or applying TableUpdate
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding a word of caution to the method, I'm fine with keeping it public for now.
I know that this is quite late in the process here, but I think it might be complimentary to the builder we're seeing here. For PyIceberg we've taken a different approach, where we apply the MetadataUpdates
(AddSchema
, SetCurrentSchema
, etc) to the metadata, rather than having a builder where you can modify the TableMetadata
. Code can be found here: https://github.com/apache/iceberg-python/blob/60800d88c7e44fe2ed58a25f1b0fcd5927156adf/pyiceberg/table/update/__init__.py#L215-L494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TableUpdate
is currently only part of the rest catalog - not of the core spec.
I think we should keep them public unless we go switch to the python approach.
I added a word of warning that we do not check compatibility. Let me know if this is sufficient.
We should extend it with a link to the new method that is implemented in #697 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TableUpdate is currently only part of the rest catalog - not of the core spec.
Just to mention that it's part of core crate is rust:
pub enum TableUpdate { |
I think we should keep them public unless we go switch to the python approach.
In fact, the original design was following java/python approach, see
pub struct Transaction<'a> { |
I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value
Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation.
226a485
to
164e90c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @c-thiel for this great pr, generally LGTM! I've left some comments to improve, but overall looks great! Last but not least, I would highly suggest to split this into smaller one so that we can have more careful review about this. As you can see, we have some discussions around it which don't seem to have clear answer.
pub fn current_schema(&self) -> &SchemaRef { | ||
self.metadata.current_schema() | ||
} | ||
|
||
/// Get the current last column id | ||
#[inline] | ||
pub fn last_column_id(&self) -> i32 { | ||
self.metadata.last_column_id | ||
} | ||
|
||
/// Get the current last updated timestamp | ||
#[inline] | ||
pub fn last_updated_ms(&self) -> i64 { | ||
self.metadata.last_updated_ms | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As TableMetadataBuilder
, I don't think we should expose these functions as public apis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My motivation here was actually add_schema
.
It's very hard to build a compatible schema if there is no way to determine the current schema. We could argue however that users should just track the schema externally.
I am happy with making it private, just wanted to give the reason why I added them.
If you still think that's better, let me know, then I remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make them private. Users are not supposed to call them directly, and are expected to go through transaction api, see java example here: https://github.com/apache/iceberg/blob/9aeea63f8b74a6e0a468baaf7c705c19df0ebfb7/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L48
I think pyiceberg follow similar approach.
self.metadata.snapshot_log.clear(); | ||
} | ||
|
||
if self.metadata.refs.remove(ref_name).is_some() || ref_name == MAIN_BRANCH { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if self.metadata.refs.remove(ref_name).is_some() || ref_name == MAIN_BRANCH { | |
if self.metadata.refs.remove(ref_name).is_some() { |
Though it's correct, I think we should not add this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep it actually.
A few lines above we have if ref_name == MAIN_BRANCH {}
. Thus, if ref_name
is main, then we mutate the internal state of the builder - even if the branch does not exist. Because we mutate the state of the builder, we should record the change.
So I think we have two choices:
- Leave the
|| ref_name == MAIN_BRANCH
- Only clear snapshot log & set last snapshot to None if main branch actually exists. To do this we modify the if above and do the change you proposed.
@Fokko I guess its a nit, but Java has the same problem:
If main does not exist the internal state of the builder is mutated but the change not recorded in changes:
https://github.com/apache/iceberg/blob/daa24f9c3a56e18d188097deb2dd79cc991c9a78/core/src/main/java/org/apache/iceberg/TableMetadata.java#L1269-L1271
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if ref_name == MAIN_BRANCH
, we have changed it. My assumpation is that we always have MAIN_BRANCH
since it's default branch? Please correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check on the Java side, thanks for the pointer 👍
Okay, I've added this to the spec a while ago: apache#7445 But I think this was a mistake, and we should not expose this to the public APIs, as it is much better to track this internally. I noticed this while reviewing apache/iceberg-rust#587 Removing this as part of the APIs in Java, and the Open-API update makes it much more resilient, and don't require the clients to compute this value. For example. when there are two conflicting schema changes, the last-column-id must be recomputed correctly when doing the retry operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished second round review. There are some files I haven't check yet, need another review.
pub fn current_schema(&self) -> &SchemaRef { | ||
self.metadata.current_schema() | ||
} | ||
|
||
/// Get the current last column id | ||
#[inline] | ||
pub fn last_column_id(&self) -> i32 { | ||
self.metadata.last_column_id | ||
} | ||
|
||
/// Get the current last updated timestamp | ||
#[inline] | ||
pub fn last_updated_ms(&self) -> i64 { | ||
self.metadata.last_updated_ms | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make all these three methods as private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here: #587 (comment)
/// # Errors | ||
/// - The ref is unknown. | ||
/// - Any of the preconditions of `self.add_snapshot` are not met. | ||
pub fn append_snapshot(self, snapshot: Snapshot, ref_name: Option<&str>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In fact that methods is a wrong design for me. The implementation lacks some necessary check, and it's inconsistent with other parts, e.g.
TableMetadata
is designed to be immutable, and all modifications should go throughtTableMetadataBuilder
. I think we should make that api as deprecated, and remove it in future. - Yes, but it's weird when it has a
branch
argument. And the reason I suggest this is that I want to keep api consistent with java implementation. - Similar to 2, I want to keep api name consistent to java. And you are right the name could be either branch or tag. I'm ok with argument name, but I prefer to have function name consistent with java api.
|
||
/// Append a snapshot to the specified branch. | ||
/// If branch is not specified, the snapshot is appended to the main branch. | ||
/// The `ref` must already exist. Retention settings from the `ref` are re-used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in iceberg's term tag is a special branch, while ref
typicall means SnapshotRef
which includes more things like retention time, etc. I've updated it to "branch
or tag
", what do you think?
self.metadata.snapshot_log.clear(); | ||
} | ||
|
||
if self.metadata.refs.remove(ref_name).is_some() || ref_name == MAIN_BRANCH { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if ref_name == MAIN_BRANCH
, we have changed it. My assumpation is that we always have MAIN_BRANCH
since it's default branch? Please correct me if I'm wrong.
return Ok(self); | ||
} | ||
|
||
// ToDo Discuss: Java builds a fresh spec here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that we should keep
add_schema
for now. - I agree that we should implement the issue proposed by @Fokko , which will be part of our trable transaction api.
- I don't think we should discourage the use of
add_schema
since they have different responsibilities.TableMetadataBuilder
do some necessary checks to avoid broke table metadata, but it doesn't guarantee that behavior. We should be careful with our transaction, which should guarantee to produce valid table metadata.
In fact, I'm thinking about make all TableMetadataBuilder
methods crate private. Java makes them public since java's access modifier is limited and doesn't support complex path visiblity like rust. But I think these methods should be called by transaction api or applying TableUpdate
. What do you think?
@liurenjie1024 for |
Co-authored-by: Renjie Liu <[email protected]>
Slept over it and |
/// Remove snapshots by its ids from the table metadata. | ||
/// Does nothing if a snapshot id is not present. | ||
/// Keeps as changes only the snapshots that were actually removed. | ||
pub fn remove_snapshots(mut self, snapshot_ids: &[i64]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently when removing snapshots we might have parent_snapshot_id
s of other snapshots pointing to non-existing snapshots. This follows the behavior in Java.
Is this desirable? Or would it be more correct to set them to None
as there is no parent available anymore?
@Fokko
/// # Errors | ||
/// - The ref is unknown. | ||
/// - Any of the preconditions of `self.add_snapshot` are not met. | ||
pub fn append_snapshot(self, snapshot: Snapshot, ref_name: Option<&str>) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for misclarification, I mean deprecating the append_snapshot
method in TableMetadata
, and planning to remove it in future release. I think adding #[deprecate]
annotation would be enough since cargo will warn about this. Also I think we should remove this method in TableMetadataBuilder
?
@@ -912,14 +950,14 @@ mod tests { | |||
#[test] | |||
fn test_check_last_assigned_partition_id() { | |||
let metadata = metadata(); | |||
|
|||
println!("{:?}", metadata.last_partition_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this debug statement?
@@ -376,6 +375,24 @@ impl Schema { | |||
pub fn accessor_by_field_id(&self, field_id: i32) -> Option<Arc<StructAccessor>> { | |||
self.field_id_to_accessor.get(&field_id).cloned() | |||
} | |||
|
|||
/// Check if this schema is identical to another schema semantically - excluding schema id. | |||
pub(crate) fn is_same_schema(&self, other: &SchemaRef) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement PartialEq
, Eq
trait?
} | ||
} | ||
|
||
impl From<TableMetadataBuildResult> for TableMetadata { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need this? metadata
field already exposed as public field.
This PR is now ready for first reviews.
Some Remarks:
add_sort_order
andadd_partition_spec
the Java code re-builds the added sort-order against the current schema by matching column names. This implementation currently does not do this. Adding this feature would requirePartitionSpec
(bound) to store the schema it was bound against (probably a good idea anyway) and splitSortOrder
in bound and unbound, where the boundSortOrder
also stores the schema it was bound against. Instead, this implementation assumes that provided sort-orders and partition-specs are valid for the current schema. Compatibility with the current schema is tested.add_schema
method does not require anew_last_column_id
argument. In java there is a todo to achieve the same. I put my reasoning in a comment in the code, feel free to comment on it.new()
behaviour that now re-assignes field-ids to start from 0. Some tests started from 1 before. Re-assigning the ids, just like in Java, ensures that fresh metadata always has fresh and correct ids even if they are created manually or re-used from another metadata.