Partition Evolution #245

amogh-jahagirdar · 2023-12-29T17:53:17Z

Fixes #193

amogh-jahagirdar · 2023-12-29T18:12:08Z

There's still a lot of cleanup required, and need to add unit tests, also some bugs I'm encountering. But I'm putting up this draft since the core pieces are here.

pyiceberg/table/__init__.py

pyiceberg/partitioning.py

pyiceberg/table/__init__.py

pyiceberg/partitioning.py

Fokko · 2024-01-25T10:21:37Z

pyiceberg/partitioning.py

+    if not source_name:
+        raise ValueError(f"Could not find column with id {field.source_id}")
+
+    transform = field.transform


Nice to have: Do we also want to have a single dispatch to map these types?

Can you create an issue for this?

pyiceberg/table/__init__.py

amogh-jahagirdar · 2024-02-02T05:41:16Z

pyiceberg/table/__init__.py

+    def last_partition_id(self) -> Optional[int]:
+        """Return the highest assigned partition field ID across all specs for the table or None if the table is unpartitioned and there are no specs."""
+        if len(self.specs()) == 1 and self.spec().is_unpartitioned():
+            return None
+        return self.metadata.last_partition_id


@Fokko @HonahX I added this API to the Table since we'll need it in the implementation and we don't want to directly access TableMetadata let me know what you think

I think we probably should update

iceberg-python/pyiceberg/partitioning.py

Lines 148 to 152 in 7fbcc22

@property

def last_assigned_field_id(self) -> int:

if self.fields:

return max(pf.field_id for pf in self.fields)

return PARTITION_FIELD_ID_START

to return PARTITION_FIELD_ID_START - 1 for unpartitioned spec. Then we can return the last_partition_id from metadata directly because the metadata should have last_partition_id=999 for unpartitioned table.

Java implementation uses PARTITION_FIELD_ID_START - 1 for unpartitioned spec:
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L344-L345
https://github.com/apache/iceberg/blob/9921937d8285dec9a19fd16b0cd82d451a8aca9e/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L319-L321

I checked locally that unpartitioned tables created by spark-iceberg-runtime have last_partition_id=999, while those created by pyiceberg have last_partition_id=1000:

I was thinking about that but wasn't sure about breaking API behavior which is why I added a new API. If we have the flexibility here to change the API we should do that. I think we can because arguably it's incorrect to return 1000 for an unpartitioned table so it's really a fix.

amogh-jahagirdar · 2024-02-02T05:47:59Z

tests/test_integration_partition_evolution.py

+        update.remove_field("day_ts").remove_field("bucketed_id")
+    with table_v2.update_spec() as update:
+        update.add_field("str", TruncateTransform(2), "truncated_str")
+    _validate_new_partition_fields(table_v2, 1002, 2, PartitionField(3, 1002, TruncateTransform(2), "truncated_str"))


This test shows why assigning new field IDs based on the last field ID across all specs is important to avoid collisions. In this case if we just used the last spec, after the remove_field is done for the original partitions, the latest spec would just be the unpartitioned spec. Then when we go and add the new truncated_str partitioned field, we would create a field ID of 1000 which is not what we want (it'll collide with the original 1000 field ID of the bcuket transform on ID)

pyiceberg/table/__init__.py

HonahX · 2024-02-05T05:31:36Z

pyiceberg/table/__init__.py

+    def last_partition_id(self) -> Optional[int]:
+        """Return the highest assigned partition field ID across all specs for the table or None if the table is unpartitioned and there are no specs."""
+        if len(self.specs()) == 1 and self.spec().is_unpartitioned():
+            return None
+        return self.metadata.last_partition_id


I think we probably should update

iceberg-python/pyiceberg/partitioning.py

Lines 148 to 152 in 7fbcc22

@property

def last_assigned_field_id(self) -> int:

if self.fields:

return max(pf.field_id for pf in self.fields)

return PARTITION_FIELD_ID_START

to return PARTITION_FIELD_ID_START - 1 for unpartitioned spec. Then we can return the last_partition_id from metadata directly because the metadata should have last_partition_id=999 for unpartitioned table.

Java implementation uses PARTITION_FIELD_ID_START - 1 for unpartitioned spec:
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L344-L345
https://github.com/apache/iceberg/blob/9921937d8285dec9a19fd16b0cd82d451a8aca9e/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L319-L321

I checked locally that unpartitioned tables created by spark-iceberg-runtime have last_partition_id=999, while those created by pyiceberg have last_partition_id=1000:

pyiceberg/table/__init__.py

tests/test_integration_partition_evolution.py

amogh-jahagirdar · 2024-02-06T06:10:21Z

pyiceberg/table/metadata.py

@@ -308,7 +308,8 @@ def construct_partition_specs(cls, data: Dict[str, Any]) -> Dict[str, Any]:
                data[PARTITION_SPECS] = [{"field-id": 0, "fields": ()}]

        data[LAST_PARTITION_ID] = max(
-            [field.get(FIELD_ID) for spec in data[PARTITION_SPECS] for field in spec[FIELDS]], default=PARTITION_FIELD_ID_START
+            [field.get(FIELD_ID) for spec in data[PARTITION_SPECS] for field in spec[FIELDS]],
+            default=PARTITION_FIELD_ID_START - 1,


This needs to be updated so that in case there are partition specs, we return 999. It's insufficient to just update the PartitionSpec#last_assigned_field_id method. I do believe this is spec compliant since the spec doesn't explicitly say what values these IDs should be. This is also what Spark does when one creates an unpartitioned table. The spec does say in v1, ids were assigned starting at 1000, which is still followed. So I think we're covered. @Fokko @HonahX

Ah, I see. So the next one will be 1000. It feels a bit like a workaround, let me check

amogh-jahagirdar · 2024-02-06T06:10:43Z

tests/catalog/test_hive.py

@@ -277,7 +277,7 @@ def test_create_table(table_schema_simple: Schema, hive_database: HiveDatabase,
            )
        ],
        current_schema_id=0,
-        last_partition_id=1000,
+        last_partition_id=999,


See https://github.com/apache/iceberg-python/pull/245/files#r1479265846

I think this one is a bit odd since it is V2 table. I had to dig into the code myself a bit as well. I noticed that the last_partition_id is optional in the metadata. What do you think of the following solution: amogh-jahagirdar#1

I checked V2 unpartitioned table created by spark-iceberg-runtime and the last_partition_id stored in the metadata is 999.

Therefore I suggested to update the last_partition_id() in pyiceberg to align with the java implementation.

In general, I think 999 is spec compliant since it is for UnpartitionedSpec, where there is no existing partition field. It implies that 1000 will be the id for the first valid partition field and thus align with the spec. Do these sound reasonable? Appreciate your thoughts on this!

@Fokko I took a look and integrated the changes, and after going back/forth I think I'd like to keep the changes as is.
My rationale is that even after those changes there's some more workarounds that need to be done to make sure new IDs start at 1000 (after taking the changes directly, field IDs for non-REST will start at 1001 without any more changes).

Now, technically it does not seem to be a hard requirement that ids need to start at 1000 for v2 tables. Even for v1 starting at 1000 does not seem to be a requirement, rather just how the Java lib was implemented.

In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation.

I read this as "this was how we originally implemented in Java but not really required so long as it's unique and increasing for new fields"

All that said, I'd advocate for just following the practice of starting at 1000 for both v1 and v2 because it's just the established model and avoids any confusion.

On returning 999 for unpartitioned spec:

As @HonahX alluded to I think that makes for a logical API, considering we want to start a field at 1000, the unpartitioned spec (no fields) last assigned field ID should be one less than that. I don't think we want to return 1000 in that case. This is also what Spark sets for the unpartitioned spec, and is spec compliant (since the spec doesn't mandate any particular IDs)

I could see the argument where we just want to return None for a last_assigned_partition_id for this API but that just shifts the responsibility to other locations to make sure Ids are set correctly according to our scheme which ends up being more messy imo compared to just defining the API to return 999 if the only spec is the unpartitioned spec (which is the value which would be set anyways).

I also think it makes sense to set last-assigned-partition-id for both V1 and V2. Even though it's optional for V1 we set the other optional fields for v1 metadata so it seems a bit odd to make this particular case an exception.

Sounds reasonable to me, and as long as it is in line with the spec, I'm okay with it 👍

pyiceberg/table/__init__.py

amogh-jahagirdar · 2024-02-26T03:47:44Z

tests/catalog/test_hive.py

@@ -277,7 +277,7 @@ def test_create_table(table_schema_simple: Schema, hive_database: HiveDatabase,
            )
        ],
        current_schema_id=0,
-        last_partition_id=1000,
+        last_partition_id=999,


@Fokko I took a look and integrated the changes, and after going back/forth I think I'd like to keep the changes as is.
My rationale is that even after those changes there's some more workarounds that need to be done to make sure new IDs start at 1000 (after taking the changes directly, field IDs for non-REST will start at 1001 without any more changes).

Now, technically it does not seem to be a hard requirement that ids need to start at 1000 for v2 tables. Even for v1 starting at 1000 does not seem to be a requirement, rather just how the Java lib was implemented.

In v1, partition field IDs were not tracked, but were assigned sequentially starting at 1000 in the reference implementation.

I read this as "this was how we originally implemented in Java but not really required so long as it's unique and increasing for new fields"

All that said, I'd advocate for just following the practice of starting at 1000 for both v1 and v2 because it's just the established model and avoids any confusion.

On returning 999 for unpartitioned spec:

As @HonahX alluded to I think that makes for a logical API, considering we want to start a field at 1000, the unpartitioned spec (no fields) last assigned field ID should be one less than that. I don't think we want to return 1000 in that case. This is also what Spark sets for the unpartitioned spec, and is spec compliant (since the spec doesn't mandate any particular IDs)

I could see the argument where we just want to return None for a last_assigned_partition_id for this API but that just shifts the responsibility to other locations to make sure Ids are set correctly according to our scheme which ends up being more messy imo compared to just defining the API to return 999 if the only spec is the unpartitioned spec (which is the value which would be set anyways).

I also think it makes sense to set last-assigned-partition-id for both V1 and V2. Even though it's optional for V1 we set the other optional fields for v1 metadata so it seems a bit odd to make this particular case an exception.

amogh-jahagirdar · 2024-02-26T03:50:08Z

tests/integration/test_partition_evolution.py

+
+
+@pytest.mark.integration
+@pytest.mark.parametrize('catalog', [pytest.lazy_fixture('catalog_hive'), pytest.lazy_fixture('catalog_rest')])


I've parameterized all the tests for testing both hive/rest.

Fokko · 2024-02-28T14:54:57Z

Thanks @amogh-jahagirdar for working on this, and sorry for the long wait for the review. Thanks @HonahX for the review 🙌

amogh-jahagirdar commented Dec 30, 2023

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jan 4, 2024

View reviewed changes

pyiceberg/partitioning.py Outdated Show resolved Hide resolved

Fokko reviewed Jan 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jan 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed Jan 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Show resolved Hide resolved

Fokko reviewed Jan 4, 2024

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the partition-evolution branch from b7e0713 to c80f5f7 Compare January 13, 2024 22:25

Fokko mentioned this pull request Jan 19, 2024

REPLACE TABLE Support #281

Open

amogh-jahagirdar force-pushed the partition-evolution branch 2 times, most recently from 21dd373 to b29f5f9 Compare January 20, 2024 01:41

amogh-jahagirdar mentioned this pull request Jan 20, 2024

Core: Cleanup assertion messages in partition spec tests apache/iceberg#9528

Merged

amogh-jahagirdar force-pushed the partition-evolution branch 4 times, most recently from 53349ad to ac50b33 Compare January 25, 2024 03:27

amogh-jahagirdar commented Jan 25, 2024

View reviewed changes

pyiceberg/partitioning.py Outdated Show resolved Hide resolved

amogh-jahagirdar force-pushed the partition-evolution branch from ac50b33 to 9d3e9f1 Compare January 25, 2024 03:37