Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support create partition table for non REST catalog #577

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

FANNG1
Copy link
Contributor

@FANNG1 FANNG1 commented Aug 23, 2024

fixes: #578

@FANNG1 FANNG1 marked this pull request as draft August 23, 2024 15:20
@FANNG1 FANNG1 marked this pull request as ready for review August 24, 2024 09:00
@FANNG1 FANNG1 changed the title [SIP] support create partition table for non REST catalog feat: support create partition table for non REST catalog Aug 24, 2024
@liurenjie1024
Copy link
Contributor

Hi, @FANNG1 Thanks for your contribution. The reason why we use UnboundPartitionSpec rather PartitionSpec is to simplify the usage of this method. PartitionSpec is bound to a schema, and spec id, field ids are supposed to be meaningful. These ids are not supposed to be passed by user, but by the catalog implementation.

@FANNG1
Copy link
Contributor Author

FANNG1 commented Aug 30, 2024

Hi, @FANNG1 Thanks for your contribution. The reason why we use UnboundPartitionSpec rather PartitionSpec is to simplify the usage of this method. PartitionSpec is bound to a schema, and spec id, field ids are supposed to be meaningful. These ids are not supposed to be passed by user, but by the catalog implementation.

I agree that PartitionSpec is bound to a schema which is not user friendly, but when building UnboundPartitionSpec, we need source column id not name, it seems not friendly too? How about Using a separate TableRequestBuilder to build the TableCreation, in which PartitionSpec is generated by a more user friendly interfaces, WDYT?

pub struct TableCreation {
    /// The name of the table.
    pub name: String,
    /// The location of the table.
    #[builder(default, setter(strip_option))]
    pub location: Option<String>,
    /// The schema of the table.
    pub schema: Schema,
    /// The partition spec of the table, could be None.
    #[builder(default, setter(strip_option, into))]
    pub partition_spec: Option<PartitionSpec>,
    /// The sort order of the table.
    #[builder(default, setter(strip_option))]
    pub sort_order: Option<SortOrder>,
    /// The properties of the table.
    #[builder(default)]
    pub properties: HashMap<String, String>,
}

@liurenjie1024
Copy link
Contributor

The reason we use PartitionSpec was that some fields, such as spec id, partition field id should not be passed by user when creating a table. But I think you are right when a user wants to create partition spec of a table, it would be more nature to build against a schema. This maybe confusing to user because their spec id and partition field id will be lost, it seems that there is no type safe approach to ensure, and we need to add some doc to explain it.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FANNG1 for this pr. However I think there are some prepartion work before we can actually finished this pr. If we can narrow down the goal of this pr to change type for UnboundPartitonSpec to PartitionSpec only, we can merge this.

"Can't create table with partition spec now",
))
}
Some(p) => HashMap::from([(p.spec_id(), Arc::new(p))]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, we can't simply add a spec. We should use AddPartitionSpec transaction api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I couldn't understand AddPartitionSpec transaction api, could you provide more context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to iceberg-python's implementation The whole table creation process is treated as a transaction, similar to transactions.

}

#[tokio::test]
async fn test_partition_table() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests dir is integration tests for iceberg crate. I think this is create partitioned table should apply to all catalogs, rather memory catalog only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean we should test again for all catalogs, like Hive, REST, JDBC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it would be easier if we could finish #519 first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions:

  1. Please don't remove ANCHOR comments, those are for websites.
  2. It would be better to add another create partitioned table example rather modifying current one.

@FANNG1
Copy link
Contributor Author

FANNG1 commented Sep 2, 2024

Thanks @FANNG1 for this pr. However I think there are some prepartion work before we can actually finished this pr. If we can narrow down the goal of this pr to change type for UnboundPartitonSpec to PartitionSpec only, we can merge this.

@liurenjie1024 , thanks for your review, let's keep consistent with what means change type for UnboundPartitonSpec to PartitionSpec only. The PR consistent serval parts:

  1. Change UnboundPartitonSpec to PartitionSpec in TableCreation
  2. Use PartitionSpec to build the table metadata
  3. Integration tests about creating and showing partition table
  4. Examples about create partition table and read data from table.

What you mean is just keeping the first part?

@liurenjie1024
Copy link
Contributor

Thanks @FANNG1 for this pr. However I think there are some prepartion work before we can actually finished this pr. If we can narrow down the goal of this pr to change type for UnboundPartitonSpec to PartitionSpec only, we can merge this.

@liurenjie1024 , thanks for your review, let's keep consistent with what means change type for UnboundPartitonSpec to PartitionSpec only. The PR consistent serval parts:

  1. Change UnboundPartitonSpec to PartitionSpec in TableCreation
  2. Use PartitionSpec to build the table metadata
  3. Integration tests about creating and showing partition table
  4. Examples about create partition table and read data from table.

What you mean is just keeping the first part?

Yes, I think keeping only part 1 is a good start. Add PartitionSpec is sth how similar to a transaction with two actions:

  1. Add partition spec
  2. Set default partition spec.

I think it's better to implement them in transaction api.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support create partition table for non REST catalog
2 participants