Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add Variant data type #11324

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion api/src/main/java/org/apache/iceberg/Schema.java
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ public class Schema implements Serializable {
private static final int DEFAULT_SCHEMA_ID = 0;
private static final int DEFAULT_VALUES_MIN_FORMAT_VERSION = 3;
private static final Map<Type.TypeID, Integer> MIN_FORMAT_VERSIONS =
ImmutableMap.of(Type.TypeID.TIMESTAMP_NANO, 3);
ImmutableMap.of(Type.TypeID.TIMESTAMP_NANO, 3, Type.TypeID.VARIANT, 3);

private final StructType struct;
private final int schemaId;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,8 @@ private static String sanitize(Type type, Object value, long now, int today) {
case DECIMAL:
case FIXED:
case BINARY:
// for boolean, uuid, decimal, fixed, and binary, match the string result
case VARIANT:
// for boolean, uuid, decimal, fixed, variant, and binary, match the string result
return sanitizeSimpleString(value.toString());
}
throw new UnsupportedOperationException(
Expand Down Expand Up @@ -562,7 +563,7 @@ private static String sanitize(Literal<?> literal, long now, int today) {
} else if (literal instanceof Literals.DoubleLiteral) {
return sanitizeNumber(((Literals.DoubleLiteral) literal).value(), "float");
} else {
// for uuid, decimal, fixed, and binary, match the string result
// for uuid, decimal, fixed, variant, and binary, match the string result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay but we may be missing information by not sanitizing based on the variant's type (i.e. date) and it would be nice to have some idea of the structure in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like

{
 foo:
   bar: 3 
   baz: 
      bozz: "flew"
 } 

to

{ 
  (hash-foo):
     (hash-bar) : (1 digit number)
     (hash-baz) :
        (hash-bozz) : (hash-xaxa)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a nice feature but probably ok for it's own issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping me understand the concept. I have filed a followup issue #11479 and will work on separately.

return sanitizeSimpleString(literal.value().toString());
}
}
Expand Down
2 changes: 2 additions & 0 deletions api/src/main/java/org/apache/iceberg/transforms/Identity.java
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ class Identity<T> implements Transform<T, T> {
*/
@Deprecated
public static <I> Identity<I> get(Type type) {
Preconditions.checkArgument(!type.isVariantType(), "Unsupported type for identity: %s", type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to fix "canTransform" as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm I see that's already covered since "variant" is not considered a primitive

Copy link
Member

@RussellSpitzer RussellSpitzer Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preconditions.checkArgument(type.typeId() != Types.VariantType);

So we can avoid adding isVariantType to the interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Variant is not a primitive type, canTransform() will be false.


return new Identity<>(type);
}

Expand Down
7 changes: 6 additions & 1 deletion api/src/main/java/org/apache/iceberg/types/Type.java
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ enum TypeID {
DECIMAL(BigDecimal.class),
STRUCT(StructLike.class),
LIST(List.class),
MAP(Map.class);
MAP(Map.class),
VARIANT(Object.class);

private final Class<?> javaClass;

Expand Down Expand Up @@ -92,6 +93,10 @@ default boolean isListType() {
return false;
}

default boolean isVariantType() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in a deprecated method, do we have any other reason to add this? I think it probably doesn't need to be apart of the type interface. We could always just check if the type is VARIANT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is more a helper function instead of checking if the type is VARIANT. Actually it will be used in some other places later.

return false;
}

default boolean isMapType() {
return false;
}
Expand Down
1 change: 1 addition & 0 deletions api/src/main/java/org/apache/iceberg/types/TypeUtil.java
Original file line number Diff line number Diff line change
Expand Up @@ -534,6 +534,7 @@ private static int estimateSize(Type type) {
case FIXED:
return ((Types.FixedType) type).length();
case BINARY:
case VARIANT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for this size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have the accurate size for Variant similar to Binary. So I use the same value as Binary. I'm wondering how we come up with 80 for Binary.

return 80;
case DECIMAL:
// 12 (header) + (12 + 12 + 4) (BigInteger) + 4 (scale) = 44 bytes
Expand Down
29 changes: 29 additions & 0 deletions api/src/main/java/org/apache/iceberg/types/Types.java
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ private Types() {}
.put(StringType.get().toString(), StringType.get())
.put(UUIDType.get().toString(), UUIDType.get())
.put(BinaryType.get().toString(), BinaryType.get())
.put(VariantType.get().toString(), VariantType.get())
.buildOrThrow();

private static final Pattern FIXED = Pattern.compile("fixed\\[\\s*(\\d+)\\s*\\]");
Expand Down Expand Up @@ -412,6 +413,34 @@ public String toString() {
}
}

public static class VariantType extends PrimitiveType {
private static final VariantType INSTANCE = new VariantType();

public static VariantType get() {
return INSTANCE;
}

@Override
public boolean isPrimitiveType() {
return false;
}

@Override
public boolean isVariantType() {
return true;
}

@Override
public TypeID typeId() {
return TypeID.VARIANT;
}

@Override
public String toString() {
return "variant";
}
}

public static class DecimalType extends PrimitiveType {
public static DecimalType of(int precision, int scale) {
return new DecimalType(precision, scale);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;

import org.apache.iceberg.exceptions.ValidationException;
import org.apache.iceberg.transforms.Transforms;
import org.apache.iceberg.types.Types;
import org.apache.iceberg.types.Types.NestedField;
Expand All @@ -34,7 +35,8 @@ public class TestPartitionSpecValidation {
NestedField.required(3, "another_ts", Types.TimestampType.withZone()),
NestedField.required(4, "d", Types.TimestampType.withZone()),
NestedField.required(5, "another_d", Types.TimestampType.withZone()),
NestedField.required(6, "s", Types.StringType.get()));
NestedField.required(6, "s", Types.StringType.get()),
NestedField.required(7, "v", Types.VariantType.get()));

@Test
public void testMultipleTimestampPartitions() {
Expand Down Expand Up @@ -312,4 +314,15 @@ public void testAddPartitionFieldsWithAndWithoutFieldIds() {
assertThat(spec.fields().get(2).fieldId()).isEqualTo(1006);
assertThat(spec.lastAssignedFieldId()).isEqualTo(1006);
}

@Test
public void testVariantUnsupported() {
assertThatThrownBy(
() ->
PartitionSpec.builderFor(SCHEMA)
.add(7, 1005, "variant_partition1", Transforms.bucket(5))
.build())
.isInstanceOf(ValidationException.class)
.hasMessage("Cannot partition by non-primitive source field: variant");
}
}
14 changes: 14 additions & 0 deletions api/src/test/java/org/apache/iceberg/transforms/TestBucketing.java
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,20 @@ public void testVerifiedIllegalNumBuckets() {
.hasMessage("Invalid number of buckets: 0 (must be > 0)");
}

@Test
public void testVariantUnsupported() {
assertThatThrownBy(() -> Transforms.bucket(Types.VariantType.get(), 3))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Cannot bucket by type: variant");

Transform<Object, Integer> bucket = Transforms.bucket(3);
assertThatThrownBy(() -> bucket.bind(Types.VariantType.get()))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Cannot bucket by type: variant");

assertThat(bucket.canTransform(Types.VariantType.get())).isFalse();
}

private byte[] randomBytes(int length) {
byte[] bytes = new byte[length];
testRandom.nextBytes(bytes);
Expand Down
18 changes: 18 additions & 0 deletions api/src/test/java/org/apache/iceberg/transforms/TestIdentity.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
package org.apache.iceberg.transforms;

import static org.assertj.core.api.Assertions.assertThat;
import static org.assertj.core.api.Assertions.assertThatThrownBy;

import java.math.BigDecimal;
import java.nio.ByteBuffer;
Expand Down Expand Up @@ -155,4 +156,21 @@ public void testBigDecimalToHumanString() {
.as("Should not modify Strings")
.isEqualTo(decimalString);
}

@Test
public void testVariantUnsupported() {
assertThatThrownBy(() -> Transforms.identity().bind(Types.VariantType.get()))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Cannot bind to unsupported type: variant");

assertThatThrownBy(() -> Transforms.fromString(Types.VariantType.get(), "identity"))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Unsupported type for identity: variant");

assertThatThrownBy(() -> Transforms.identity(Types.VariantType.get()))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Unsupported type for identity: variant");

assertThat(Transforms.identity().canTransform(Types.VariantType.get())).isFalse();
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ public void testIdentityTypes() throws Exception {
Types.TimestampNanoType.withZone(),
Types.StringType.get(),
Types.UUIDType.get(),
Types.BinaryType.get()
Types.BinaryType.get(),
Types.VariantType.get()
};

for (Type type : identityPrimitives) {
Expand Down
2 changes: 2 additions & 0 deletions api/src/test/java/org/apache/iceberg/types/TestTypes.java
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ public void fromPrimitiveString() {

assertThat(Types.fromPrimitiveString("Decimal(2,3)")).isEqualTo(Types.DecimalType.of(2, 3));

assertThat(Types.fromPrimitiveString("Variant")).isEqualTo(Types.VariantType.get());

assertThatExceptionOfType(IllegalArgumentException.class)
.isThrownBy(() -> Types.fromPrimitiveString("Unknown"))
.withMessageContaining("Unknown");
Expand Down
16 changes: 16 additions & 0 deletions core/src/test/java/org/apache/iceberg/TestSortOrder.java
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,22 @@ public void testSortedColumnNames() {
assertThat(sortedCols).containsExactly("s.id", "data");
}

@TestTemplate
public void testVariantUnsupported() {
Schema v3Schema =
new Schema(
Types.NestedField.required(3, "id", Types.LongType.get()),
Types.NestedField.required(4, "data", Types.StringType.get()),
Types.NestedField.required(
5,
"struct",
Types.StructType.of(Types.NestedField.optional(6, "v", Types.VariantType.get()))));

assertThatThrownBy(() -> SortOrder.builderFor(v3Schema).withOrderId(10).asc("struct.v").build())
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Unsupported type for identity: variant");
}

@TestTemplate
public void testPreservingOrderSortedColumnNames() {
SortOrder order =
Expand Down
38 changes: 38 additions & 0 deletions core/src/test/java/org/apache/iceberg/TestTableMetadata.java
Original file line number Diff line number Diff line change
Expand Up @@ -1687,6 +1687,44 @@ public void testV3TimestampNanoTypeSupport() {
3);
}

@Test
public void testV3VariantTypeSupport() {
Copy link
Member

@RussellSpitzer RussellSpitzer Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is copying a lot of tests in this class but we should start future proofing a bit more imho. We also have tests around this sort of thing in TestSchema.java. I think it is probably ok to just keep all of our schema validation tests there, but it wouldn't hurt to have some redundancy here as well.

Refactoring the whole suite can come in another pr but I think we should build a templated test that's something like

@ParameterizedTest
@ValueSource(types = {Types.TimetstampNanos, Types.Variant, ....})
testTypeSupport(Type type) {
  Schema schemaWithType =  new Schema(
          Types.NestedField.required(1, "id", Types.LongType.get()),
          Types.NestedField.optional(2, type.name, type),
          Types.NestedField.optional(3, "arr", Types.ListType.ofRequired(4, type)),
          Types.NestedField.required(5, "struct", 
            Types.StructType.of(
                  Types.NestedField.optional(6, "inner_" + type.name, type),
                  Types.NestedField.required(7, "data", Types.StringType.get()))),
          Types.NestedField.optional(8, "struct_arr",
              Types.StructType.of(
                  Types.NestedField.optional(9, "ts", type))));
     
    //Psuedo code here 
    from 0 -> MIN_FORMAT_VERSION.get(type)
       fail to make metadata
       
    from MIN_FORMAT_VERSION.get(type) -> SUPPORTED_TABLE_VERSION
       succeed
}

The most important part about this is that we wouldn't have to continually update tests every time a new valid metadata version is added. It also would be much easier to test type compatibility. (I'm thinking that Geo is going to need the exact same thing soon)

For this PR I think it is enough to write a parameterized version for just Variant, then we could raise another PR to add in nanos and remove the redundant tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this is part of my goal to remove all tests that have V3 or V2 in their title.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also todo add Variant to Schema.MIN_FORMAT_VERSION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RussellSpitzer for already creating a PR to address that. I will include that when it's merged.

Schema v3Schema =
new Schema(
Types.NestedField.required(3, "id", Types.LongType.get()),
Types.NestedField.required(4, "data", Types.StringType.get()),
Types.NestedField.required(
5,
"struct",
Types.StructType.of(Types.NestedField.optional(6, "v", Types.VariantType.get()))));

for (int unsupportedFormatVersion : ImmutableList.of(1, 2)) {
assertThatThrownBy(
() ->
TableMetadata.newTableMetadata(
v3Schema,
PartitionSpec.unpartitioned(),
SortOrder.unsorted(),
TEST_LOCATION,
ImmutableMap.of(),
unsupportedFormatVersion))
.isInstanceOf(IllegalStateException.class)
.hasMessage(
"Invalid schema for v%s:\n"
+ "- Invalid type for struct.v: variant is not supported until v3",
unsupportedFormatVersion);
}

// should be allowed in v3
TableMetadata.newTableMetadata(
v3Schema,
PartitionSpec.unpartitioned(),
SortOrder.unsorted(),
TEST_LOCATION,
ImmutableMap.of(),
3);
}

@Test
public void onlyMetadataLocationIsUpdatedWithoutTimestampAndMetadataLogEntry() {
String uuid = "386b9f01-002b-4d8c-b77f-42c3fd3b7c9b";
Expand Down