Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add Variant data type #11324

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Oct 15, 2024

Add Variant data type to API module. We only add limited required type to API for now and we will promote stable methods to API after full implementation to the interface VariantLike later.

  • Add Variant as a new data type

Fix #11178.

@github-actions github-actions bot added the API label Oct 15, 2024
@aihuaxu aihuaxu marked this pull request as ready for review October 15, 2024 18:34
@@ -534,6 +534,7 @@ private static int estimateSize(Type type) {
case FIXED:
return ((Types.FixedType) type).length();
case BINARY:
case VARIANT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for this size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't have the accurate size for Variant similar to Binary. So I use the same value as Binary. I'm wondering how we come up with 80 for Binary.

@@ -562,7 +563,7 @@ private static String sanitize(Literal<?> literal, long now, int today) {
} else if (literal instanceof Literals.DoubleLiteral) {
return sanitizeNumber(((Literals.DoubleLiteral) literal).value(), "float");
} else {
// for uuid, decimal, fixed, and binary, match the string result
// for uuid, decimal, fixed, variant, and binary, match the string result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay but we may be missing information by not sanitizing based on the variant's type (i.e. date) and it would be nice to have some idea of the structure in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting something like

{
 foo:
   bar: 3 
   baz: 
      bozz: "flew"
 } 

to

{ 
  (hash-foo):
     (hash-bar) : (1 digit number)
     (hash-baz) :
        (hash-bozz) : (hash-xaxa)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a nice feature but probably ok for it's own issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping me understand the concept. I have filed a followup issue #11479 and will work on separately.

… full implementation. Block Transforms, SortOrder
@@ -38,6 +38,8 @@ class Identity<T> implements Transform<T, T> {
*/
@Deprecated
public static <I> Identity<I> get(Type type) {
Preconditions.checkArgument(!type.isVariantType(), "Unsupported type for identity: %s", type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to fix "canTransform" as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm I see that's already covered since "variant" is not considered a primitive

Copy link
Member

@RussellSpitzer RussellSpitzer Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preconditions.checkArgument(type.typeId() != Types.VariantType);

So we can avoid adding isVariantType to the interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Variant is not a primitive type, canTransform() will be false.

@@ -92,6 +93,10 @@ default boolean isListType() {
return false;
}

default boolean isVariantType() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used in a deprecated method, do we have any other reason to add this? I think it probably doesn't need to be apart of the type interface. We could always just check if the type is VARIANT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. This is more a helper function instead of checking if the type is VARIANT. Actually it will be used in some other places later.

@@ -1687,6 +1687,44 @@ public void testV3TimestampNanoTypeSupport() {
3);
}

@Test
public void testV3VariantTypeSupport() {
Copy link
Member

@RussellSpitzer RussellSpitzer Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is copying a lot of tests in this class but we should start future proofing a bit more imho. We also have tests around this sort of thing in TestSchema.java. I think it is probably ok to just keep all of our schema validation tests there, but it wouldn't hurt to have some redundancy here as well.

Refactoring the whole suite can come in another pr but I think we should build a templated test that's something like

@ParameterizedTest
@ValueSource(types = {Types.TimetstampNanos, Types.Variant, ....})
testTypeSupport(Type type) {
  Schema schemaWithType =  new Schema(
          Types.NestedField.required(1, "id", Types.LongType.get()),
          Types.NestedField.optional(2, type.name, type),
          Types.NestedField.optional(3, "arr", Types.ListType.ofRequired(4, type)),
          Types.NestedField.required(5, "struct", 
            Types.StructType.of(
                  Types.NestedField.optional(6, "inner_" + type.name, type),
                  Types.NestedField.required(7, "data", Types.StringType.get()))),
          Types.NestedField.optional(8, "struct_arr",
              Types.StructType.of(
                  Types.NestedField.optional(9, "ts", type))));
     
    //Psuedo code here 
    from 0 -> MIN_FORMAT_VERSION.get(type)
       fail to make metadata
       
    from MIN_FORMAT_VERSION.get(type) -> SUPPORTED_TABLE_VERSION
       succeed
}

The most important part about this is that we wouldn't have to continually update tests every time a new valid metadata version is added. It also would be much easier to test type compatibility. (I'm thinking that Geo is going to need the exact same thing soon)

For this PR I think it is enough to write a parameterized version for just Variant, then we could raise another PR to add in nanos and remove the redundant tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this is part of my goal to remove all tests that have V3 or V2 in their title.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also todo add Variant to Schema.MIN_FORMAT_VERSION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RussellSpitzer for already creating a PR to address that. I will include that when it's merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: add Variant type to iceberg
4 participants