Construction of filenames for partitioned writes #453

jqin61 · 2024-02-20T21:56:03Z

Scope
Add PartitionKey class which:

is used to hold the raw partition field and values. This is for partitioned write.
converts the python values into iceberg-typed values.
applies transform to the iceberg-typed values based on the partition spec.
generates partition path based on transform

In terms of how PartitionKey is used, please check this PR for [partitioned write support] (#353). I separate this PR out of the partitioned write to make the latter more manageable, but willing to combine the 2 if suggested.

Tests
Object Under Test:

PartitionKey could generate the hive partition part of the parquet path as spark does
PartitionKey could form the partition as a Record as spark does. (The Record is used in metadata writing.)

To achieve the goal of comparison against spark, expected path and expected partition are justified against 2 counterpart spark sqls of creating partitioned table and data insertion.

With such justifications, we found these discrepancies between the underlying utility functions in Pyiceberg and the existing spark behavior:

For boolean type partition, spark writes the hive partition part of the path as "field=false/true" while Pyiceberg (from current underlying utilities) writes as "field=False/True". This difference comes from Python boolean is capitalized.
Spark writes the path conforming to URL format, meaning, in the value part after 'field=', any characters of '=' is replaced by "%3D" and ":" is replaced by "%3A" and etc. Shall we apply urllib.parse.quote to conform to spark behavior?
For timestamp(tz) type, spark writes the hive partition part of the path as "2023-01-01T12%3A00%3A01Z", with %3A representing the ':', the timestamp ends with Z while existing Pyiceberg utilities use
(EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)).isoformat()
to write, which does not have 'Z'
For float and double
A partitioned float field with value of 3.14 would end up in the manifest entry in the manifest file as Record[double_field=3.140000104904175]. So far for Pyiceberg, we are doing it as Record[double_field=3.14] which i think is better.

For these discrepancies, should we conform to spark's behaviors?

Fokko · 2024-02-20T22:00:55Z

@jqin61 Nice! Thanks for working on this. It is getting late here, but this is on my list for tomorrow 👍

jqin61 · 2024-02-21T05:16:29Z

pyiceberg/partitioning.py

+
+@dataclass(frozen=True)
+class PartitionKey:
+    raw_partition_field_values: List[PartitionFieldValue]


Spark builds a row accessor that takes in an arrow table row and converts it to key values. The accessor seems a little unnecessary since the partition field could not be nested or a map/list, so here the class just uses a naive list of field-value pairs. Willing to change it if this is inappropriate.

jqin61 · 2024-02-21T05:19:02Z

tests/integration/test_partitioning_key.py

+            [False],
+            Record(boolean_field=False),
+            "boolean_field=False",
+            # pyiceberg writes False while spark writes false, so justification (compare expected value with spark behavior) would fail.


Skip justification (set spark_create_table_sql_for_justification, spark_data_insert_sql_for_justification to None) as it will fail: spark writes hive partition path as 'false' while pyiceberg writes hive partition path as 'False'. Shall we align with spark?

jqin61 · 2024-02-21T05:21:40Z

tests/integration/test_partitioning_key.py

+            "float_field=3.14",
+            # spark writes differently as pyiceberg, Record[float_field=3.140000104904175], path:float_field=3.14 (Record has difference)
+            # so justification (compare expected value with spark behavior) would fail.
+            None,


For a partitioned column with float/double value of 3.14, spark-iceberg has the partition in manifest entry as Record[float_field=3.140000104904175] while iceberg has it as [float_field=3.14]

jqin61 · 2024-02-21T05:24:38Z

tests/integration/test_partitioning_key.py

+            [PartitionField(source_id=11, field_id=1001, transform=IdentityTransform(), name="binary_field")],
+            [b'example'],
+            Record(binary_field=b'example'),
+            "binary_field=ZXhhbXBsZQ%3D%3D",


spark-iceberg replaces '=' to '%3D', ':' to '%3A' (and there are other url replacements) in the hive partition path. To conform to this, the PR code currently applies urllib.parse.quote() after to_human_string().

jqin61 · 2024-02-21T05:27:40Z

tests/integration/test_partitioning_key.py

+            "timestamp_field=2023-01-01T12%3A00%3A00",
+            # spark writes differently as pyiceberg, Record[timestamp_field=1672574400000000] path:timestamp_field=2023-01-01T12%3A00Z  (the Z is the difference)
+            # so justification (compare expected value with spark behavior) would fail.
+            None,


spark-iceberg writes the hive partition path for timestamp in a way that it ends with 'Z' while pyiceberg currently writes without 'Z'.

jqin61 · 2024-02-21T05:30:07Z

tests/integration/test_partitioning_key.py

+        spark_path_for_justification = (
+            snapshot.manifests(iceberg_table.io)[0].fetch_manifest_entry(iceberg_table.io)[0].data_file.file_path
+        )
+        assert spark_partition_for_justification == expected_partition_record


This is to justify that the expected path and expected partition come from existing spark behaviors.

Fokko

First of all, thanks for working on this. Secondly, I appreciate the nicely constructed PR description with the nice summary of questions:

For boolean type partition, spark writes the hive partition part of the path as "field=false/true" while Pyiceberg (from current underlying utilities) writes as "field=False/True". This difference comes from Python boolean is capitalized.

I think it is better to lowercase the true/false in this instance. Another good source of information on how to handle these things is in the Literal class, where we ignore the casing when converting a string to a boolean.

Spark writes the path conforming to URL format, meaning, in the value part after 'field=', any characters of '=' is replaced by "%3D" and ":" is replaced by "%3A" and etc. Shall we apply urllib.parse.quote to conform to spark behavior?

I think that's a good idea! 👍 I'm not sure what the complete background is, but I don't think we want to pass everything unescaped into the path:

python3
Python 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import parse
>>> parse.quote("😊")
'%F0%9F%98%8A'

For timestamp(tz) type, spark writes the hive partition part of the path as "2023-01-01T12%3A00%3A01Z", with %3A representing the ':', the timestamp ends with Z while existing Pyiceberg utilities use (EPOCH_TIMESTAMP + timedelta(microseconds=timestamp_micros)).isoformat() to write, which does not have 'Z'

I would expect the timestamptz to be stored with a Z, and timesetamp without a Z. Without the Z would mean local time, see https://en.wikipedia.org/wiki/ISO_8601.

I did a quick check, and this seems to be the case here as well. We must follow this on the PyIceberg side as well:

For float and double
A partitioned float field with value of 3.14 would end up in the manifest entry in the manifest file as Record[double_field=3.140000104904175]. So far for Pyiceberg, we are doing it as Record[double_field=3.14] which i think is better.

I agree that 3.14 is better. For the evaluator itself, we should have proper tests around that, but that's outside of the scope of this PR.

pyiceberg/partitioning.py

tests/integration/test_partitioning_key.py

pyiceberg/partitioning.py

Fokko · 2024-02-22T10:10:44Z

@jqin61 I wanted to do a second round, but I think you forgot to push? :)

jqin61 · 2024-02-22T16:46:17Z

@jqin61 I wanted to do a second round, but I think you forgot to push? :)

Hi Fokko sorry for the delayed push of the fixes. It took a little time to think through how to use the literal function. I made the changes according to your last comment except for the literal one. I think the literal function and Literal Class might not solve the issue I am encountering in the PartitionKey class - I need some utilities to convert a python datetime to micros and convert python date to days. The literal() function could only take in python primitive types and it seems wrong to extend it to take datetime/date to return TimestampLiteral and DateLiteral.

Also thanks for pointing out that timestamp in iceberg corresponds to timestamp_ntz in spark. I added tests for it and discovered some new discrepancies:
1.CAST('2023-01-01 12:00:00' AS TIMESTAMP_NTZ) becomes 2023-01-01T12:00 in the hive partition path when spark writes it while Pyiceberg writes as 2023-01-01T12:00:00
2.CAST('2023-01-01 12:00:01.000999+03:00' AS TIMESTAMP) becomes 2023-01-01T09:00:01.000999Z in the hive partition path when spark writes it while Pyiceberg writes as 2023-01-01T09:00:01.000999+00:00

I am not sure whether the path generation behavior is part of iceberg's spec and we should work towards making them exactly the same between spark-iceberg and Pyiceberg - it seems as long as the partition in the manifest entry is correct the query planning could leverage the partition info to do pruning.

sungwy · 2024-02-22T18:39:44Z

pyiceberg/partitioning.py

+
+
+@singledispatch
+def _to_iceberg_type(type: IcebergType, value: Any) -> Any:


I think the names of this function _to_iceberg_type and the variable iceberg_typed_value are causing us a bit of confusion. It looks like what we are trying to do is convert a date or datetime value to its respective EPOCH value (days from epoch, or microseconds to epoch), so that it can be used as an integer value that can be used in this line:

transformed_value = partition_field.transform.transform(iceberg_type)(iceberg_typed_value)

Should we call this variable epoch (instead of iceberg_typed_value) and change this function name to _to_epoch? and we can keep the conversion functions as we currently have it?

for types other than date and datetime, will this pattern seem weird to call _to_epoch on them?

The int is the internal value that we use to store a datetime/date/time, another one is the uuid where we accept a UUID and convert it to a string (I believe, I can check with Trino quickly).

Let me add uuid to the function dispatching. Also let me rename the function to _to_iceberg_internal_representation()?

I made the changes in commit 1a48d83

Fokko

Some final small comments, but apart from that it looks good 👍

Fokko · 2024-02-28T12:16:36Z

pyiceberg/partitioning.py

+        field_strs = []
+        value_strs = []
+        for pos, value in enumerate(data.record_fields()):
+            partition_field = self.fields[pos]  # partition field


Suggested change

partition_field = self.fields[pos] # partition field

partition_field = self.fields[pos]

Fokko · 2024-02-28T12:20:57Z

pyiceberg/partitioning.py

+
+
+@singledispatch
+def _to_iceberg_internal_representation(type: IcebergType, value: Any) -> Any:


To avoid confusion later on. Can we change this name to _to_partition_representation? The internal representation of a UUID is bytes and not str

I checked this line after spark write into iceberg for a table partitioned on uuid type column.
snapshot.manifests(iceberg_table.io)[0].fetch_manifest_entry(iceberg_table.io)[0].data_file.partition
and get
spark_partition_for_justification=Record[uuid_field='f47ac10b-58cc-4372-a567-0e02b2c3d479']
so looks like it is string representation in the data_file.partition?

Fokko · 2024-02-28T12:23:27Z

pyiceberg/partitioning.py

+            iceberg_typed_value = _to_iceberg_internal_representation(iceberg_type, raw_partition_field_value.value)
+            transformed_value = partition_field.transform.transform(iceberg_type)(iceberg_typed_value)
+            iceberg_typed_key_values[partition_field.name] = transformed_value
+        return Record(**iceberg_typed_key_values)


We're now getting into the realm of premature optimization, but ideally you don't need to set the names of the keys. The concept of a Record is that is only contains the data. Just below: self.partition_spec.partition_to_path(self.partition, self.schema) you can see that you both pass in the partition, and the schema itself. The positions of the schema should match with the data.

Hi @Fokko, thanks for the guidance! My intention of adding the keys is because this PartitionKey.partition is not only used for generating the file path but also used to initiate the Datafile.partition in the io.pyarrow.write_file(). As the integration test shows,

snapshot.manifests(iceberg_table.io)[0].fetch_manifest_entry(iceberg_table.io)[0].data_file.partition

prints

Record(timestamp_field=1672574401000000)

So I assume this data_file.partition is Record with keys.
Let me know what you think about it, thank you!

jqin61 · 2024-02-28T17:54:37Z

rebased; removed the comment; renamed the ambiguous function name

Fokko · 2024-02-29T21:14:21Z

Let's move this forward, thanks for working on this 👍

* PartitionKey Class And Tests * fix linting; add decimal input transform test * fix bool to path lower case; fix timestamptz tests; other pr comments * clean up * add uuid partition type * clean up; rename ambiguous function name

jqin61 mentioned this pull request Feb 20, 2024

partitioned write support #353

Draft

13 tasks

jqin61 force-pushed the partition-key branch from 6e16690 to 7fcf75a Compare February 21, 2024 05:06

jqin61 commented Feb 21, 2024

View reviewed changes

Fokko reviewed Feb 21, 2024

View reviewed changes

sungwy reviewed Feb 22, 2024

View reviewed changes

Fokko changed the title ~~PartitionKey~~ Construction of filenames for partitioned writes Feb 28, 2024

Fokko approved these changes Feb 28, 2024

View reviewed changes

jqin61 added 6 commits February 28, 2024 17:17

PartitionKey Class And Tests

e380cfa

fix linting; add decimal input transform test

da14e69

fix bool to path lower case; fix timestamptz tests; other pr comments

0a062cb

clean up

f9be89f

add uuid partition type

d6e9e73

clean up; rename ambiguous function name

1034823

jqin61 force-pushed the partition-key branch from 1a48d83 to 1034823 Compare February 28, 2024 17:53

Fokko merged commit 4b7e918 into apache:main Feb 29, 2024
6 checks passed

kevinjqliu mentioned this pull request Mar 2, 2024

Integration test broken #492

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Construction of filenames for partitioned writes #453

Construction of filenames for partitioned writes #453

jqin61 commented Feb 20, 2024 •

edited

Loading

Fokko commented Feb 20, 2024

jqin61 Feb 21, 2024

jqin61 Feb 21, 2024

jqin61 Feb 21, 2024

jqin61 Feb 21, 2024 •

edited

Loading

jqin61 Feb 21, 2024 •

edited

Loading

jqin61 Feb 21, 2024

Fokko left a comment

Fokko commented Feb 22, 2024

jqin61 commented Feb 22, 2024 •

edited

Loading

sungwy Feb 22, 2024

jqin61 Feb 22, 2024

Fokko Feb 27, 2024

jqin61 Feb 27, 2024

jqin61 Feb 27, 2024

Fokko left a comment

Fokko Feb 28, 2024

Fokko Feb 28, 2024

jqin61 Feb 28, 2024

jqin61 Feb 28, 2024 •

edited

Loading

Fokko Feb 28, 2024

jqin61 Feb 28, 2024 •

edited

Loading

jqin61 commented Feb 28, 2024

Fokko commented Feb 29, 2024



		@singledispatch
		def _to_iceberg_type(type: IcebergType, value: Any) -> Any:

	partition_field = self.fields[pos] # partition field
	partition_field = self.fields[pos]



		@singledispatch
		def _to_iceberg_internal_representation(type: IcebergType, value: Any) -> Any:

Construction of filenames for partitioned writes #453

Construction of filenames for partitioned writes #453

Conversation

jqin61 commented Feb 20, 2024 • edited Loading

Fokko commented Feb 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqin61 Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

jqin61 Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Fokko commented Feb 22, 2024

jqin61 commented Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqin61 Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jqin61 Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

jqin61 commented Feb 28, 2024

Fokko commented Feb 29, 2024

jqin61 commented Feb 20, 2024 •

edited

Loading

jqin61 Feb 21, 2024 •

edited

Loading

jqin61 Feb 21, 2024 •

edited

Loading

jqin61 commented Feb 22, 2024 •

edited

Loading

jqin61 Feb 28, 2024 •

edited

Loading

jqin61 Feb 28, 2024 •

edited

Loading