Support partitioning spec during data file rewrites in Spark. #11368

rdsarvar · 2024-10-21T14:26:29Z

Description

Currently, data file rewrites supports specifying the output spec ID to be used. Added functionality to provide a partition spec itself and have it added as a non-default spec if it does not already exist on the table.

Edit: I've added a Github issue to track this potential improvement here: #11459

Benefits

These changes would make it simpler to tier partition granularity by time ranges. As an example: Say your table is heavily used but mostly targets most recent data and you still want to provide the ability for folks to query back in time. You could achieve additional performance improvements by applying more granular partitions in the base table and then have a compaction job that runs by tiers:

Short term compaction (reuses the table definition - high granularity, get rid of as many small files as you can)
Long term compaction (specified partition spec that is not the default - lower granularity, will cut down the metadata stored for the table)

Notes for Reviewers

Note: This is definitely not complete and I am open to all feedback. Whether some functionalities already exist outside OR if it should be done differently.

The part I'm mostly iffy on is modifying BaseUpdatePartitionSpec.java with table.updateSpec() instead of having something like table.addSpec(partitionSpec). addNonDefaultSpec().commit()

Currently, data file rewrites supports specifying the output spec ID to be used. Added functionality to provide a partition spec itself and have it added as a non-default spec if it does not already exist on the table.

rdsarvar · 2024-10-21T14:38:55Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RewriteDataFilesProcedure.java

+            Integer partitionSpecId =
+                checkAndPreparePartitionSpec(
+                    table, partitionedByString, createPartitionIfNotExists, options);
+            options.put(OUTPUT_SPEC_ID, partitionSpecId.toString());


nitpick: I guess I could set this inside of the checkAndPreparePartitionSpec method

rdsarvar · 2024-10-21T14:39:56Z

api/src/main/java/org/apache/iceberg/UpdatePartitionSpec.java

+   * @param newSpec partition spec to override the builder use during commit
+   * @return this for method chaining names.
+   */
+  default UpdatePartitionSpec useSpec(PartitionSpec newSpec) {


This feels... hacky. Question for folks reading this would be if it's overkill to support a AddPartitionSpec operation instead of relying on the UpdatePartitionSpec

I agree, this feels hacky. I'm also not convinced it goes through the right validations. We probably want to walk diff the specs and make the necessary updates.

We probably want to walk diff the specs and make the necessary updates.

Are you thinking something simple like:

Iterate through existing partition spec and removeField all fields

Iterate through new spec and addField all fields

Or were you thinking something like:

Run 'old DIFF new' to find the fields to removeField against

Iterate through the 'new' spec and check if the current spec has the field, if not then add it

Though this approach wouldn't guarantee the partition ordering of terms, right? If I was the end user I'd expect the spec being added matches the ordering I provided exactly.

^ This last statement is mostly why I was thinking having a 'replace' functionality would make a bit more sense than an 'update' but I don't think I'm ramped up enough yet on the repo and historical decisions 😄

What are your thoughts on it?

amogh-jahagirdar

Thanks @rdsarvar , the part I'm a bit confused about is why we need a new useSpec API. I think the use case you described could be solved by adding a new spec, without setting it as the default (which we recently added support for). Then the compaction could be performed using the spec ID of that added spec.

rdsarvar · 2024-11-05T16:27:43Z

Thanks @rdsarvar , the part I'm a bit confused about is why we need a new useSpec API. I think the use case you described could be solved by adding a new spec, without setting it as the default (which we recently added support for). Then the compaction could be performed using the spec ID of that added spec.

I was looking at the public APIs and there doesn't seem to be an available method that would allow me to parse and directly add a new partition spec. Am I missing an API that would simplify this? If I were to use the updateSpec functionality and set addNonDefaultSpec then I'd need to iteratively removeField + addField over the partition spec provided in the action.

My thinking was that as part of transactions + the provided methods in the Table interface we'd still need to provide a form of addSpec functionality as there's only updateSpec right now.

Let me know if that makes sense or if I'm missing something 😄

Support partitioning spec during data file rewrites in Spark.

511e288

Currently, data file rewrites supports specifying the output spec ID to be used. Added functionality to provide a partition spec itself and have it added as a non-default spec if it does not already exist on the table.

github-actions bot added API spark core labels Oct 21, 2024

rdsarvar commented Oct 21, 2024

View reviewed changes

rdsarvar mentioned this pull request Nov 4, 2024

RewriteDataFiles: Support custom partition spec during compaction #11459

Open

3 tasks

amogh-jahagirdar reviewed Nov 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support partitioning spec during data file rewrites in Spark. #11368

Support partitioning spec during data file rewrites in Spark. #11368

rdsarvar commented Oct 21, 2024 •

edited

Loading

rdsarvar Oct 21, 2024

rdsarvar Oct 21, 2024

danielcweeks Oct 22, 2024

rdsarvar Oct 23, 2024

amogh-jahagirdar left a comment

rdsarvar commented Nov 5, 2024

Support partitioning spec during data file rewrites in Spark. #11368

Are you sure you want to change the base?

Support partitioning spec during data file rewrites in Spark. #11368

Conversation

rdsarvar commented Oct 21, 2024 • edited Loading

Description

Benefits

Notes for Reviewers

rdsarvar Oct 21, 2024

Choose a reason for hiding this comment

rdsarvar Oct 21, 2024

Choose a reason for hiding this comment

danielcweeks Oct 22, 2024

Choose a reason for hiding this comment

rdsarvar Oct 23, 2024

Choose a reason for hiding this comment

amogh-jahagirdar left a comment

Choose a reason for hiding this comment

rdsarvar commented Nov 5, 2024

rdsarvar commented Oct 21, 2024 •

edited

Loading