Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RewriteDataFiles: Support custom partition spec during compaction #11459

Open
1 of 3 tasks
rdsarvar opened this issue Nov 4, 2024 · 1 comment
Open
1 of 3 tasks

RewriteDataFiles: Support custom partition spec during compaction #11459

rdsarvar opened this issue Nov 4, 2024 · 1 comment
Labels
improvement PR that improves existing functionality

Comments

@rdsarvar
Copy link

rdsarvar commented Nov 4, 2024

Feature Request / Improvement

Context

Similar to how we can provide an explicit sort ordering OR rely on the existing sorting of the table, I would like to propose that compaction support explicit partition specifications.

Currently, compaction allows for you to specify the partition spec ID that you want to use through the options mapping. This is useful for enabling different partitioning for compaction but comes with the caveat that the partition spec had to have been applied previously to the table AND you must manually find that spec ID and apply it.

The meat of the request is:

  • Support a partition specification being configured during compaction operations (not just the ID)
    • If the partition spec provided does not exist then add it as a non-default partition spec to the table
    • If the partition spec provided exists already, use it

Motivation

In some cases folks want to be able to support partition tiering for long term storage. As an example:

  • Last 60 days of data use partition X (lets say ~1000 partitions per day)
  • Data older than 60 days use partition Y (lets say ~2 partitions per day)

This enables us to bloat the metadata with recent partitions but shrink the metadata for longer term storage so that we can have a single table over the long term instead of having to use multiple tables with a view sitting on top.

Technically this is already possible by first initializing the table with your 'archival' partition spec so it generates and ID then you swap to your 'active' partition spec. The user can then grab the ID and pass that in through options but it's an inconvenient process over just providing the spec and having Iceberg decide the next actions to be made.

Query engine

Spark

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
@rdsarvar rdsarvar added the improvement PR that improves existing functionality label Nov 4, 2024
@rdsarvar
Copy link
Author

rdsarvar commented Nov 4, 2024

I've provided a draft PR with a sample solution here: #11368 but I'm open to feedback (/throwing out that PR) for if there's a cleaner solution to this implementation.

Note: It's nowhere near mergable but it provides a gist of one implementation

@rdsarvar rdsarvar changed the title RewriteDataFiles: Support declarative partition spec during compaction RewriteDataFiles: Support custom partition spec during compaction Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

1 participant