Harmonize representations for field names to `Vec<String>` #591

c-thiel · 2024-08-29T07:14:02Z

Currently due to the way name-to-id is build, we cannot have points in columnames if it collides with a struct.

The following schema fails to build:

        let schema = Schema::builder()
            .with_schema_id(1)
            .with_fields(vec![
                NestedField::required(1, "foo", Primitive(PrimitiveType::String)).into(),
                NestedField::required(2, "foo.bar", Primitive(PrimitiveType::Int)).into(),
                NestedField::required(
                    3,
                    "foo",
                    Type::Struct(StructType::new(vec![NestedField::required(
                        4,
                        "bar",
                        Primitive(PrimitiveType::Int),
                    )
                    .into()])),
                )
                .into(),
            ])
            .build()
            .unwrap();

By prohibiting this we follow the Java implementation.
There is nothing in the iceberg spec that prohibits these names, so I think we should allow them.
For column names as a user I expect the same behaviors as for namespaces or databases with points - it needs to be escaped. As escaping depends on the query engine, and iceberg-rust as well as iceberg java has no SQL-parser, those libraries should not take away the option from the engine or make that decision in their stead.

I propose to change the representation of a colname to Vec<String> instead of just "String with points". It would also make accessor compatible between schemas - even if we decide to stick keep this artificial restriction.

The text was updated successfully, but these errors were encountered:

c-thiel · 2024-08-29T07:14:39Z

CC @Fokko , @nastra , @Xuanwo

liurenjie1024 · 2024-08-29T08:56:45Z

Allowing such a case would also require a change on the table scan api, e.g. we need a way to allow user to tell use what foo.bar actually means.

liurenjie1024 · 2024-08-29T10:07:18Z

Also cc @rdblue

yukkit · 2024-09-07T09:14:39Z

FWIW. Can we introduce a structure similar to TableIdent to represent field paths, named 'FieldPath'?

We could modify the elements of selected TableScan::column_names from string to FieldPath (and provide some utilities to simplify the construction of FieldPath).
Change the key in name_to_id to FieldPath.
Perhaps the logic for looking up fields via FieldPath would also need to be modified.

rdblue · 2024-09-16T18:25:36Z

This is an implementation detail that is not part of the spec. But I don't think it is worth bothering to enable both ["foo.bar"] and ["foo", "bar"] identifiers in the same schema. That's confusing for users, who will almost certainly be confused by tables with such odd structures.

The reference implementation has been this way for about 7 years and no one has every complained or, to my knowledge, hit a problem with this in practice. I highly recommend focusing time and effort on other improvements.

c-thiel · 2024-11-13T16:23:20Z

This is not so much about this specific use case - which I also don't care about much either, but about having two different representations for the same entity. Took me a while to formulate it so clearly.
There are many lines of code in different files just to work around this problem.

Let's assume for example we want to add a field to a schema, then the point representation in the PartitionSpec is not compatible with the one in the Schema. Java would just throw an exception in this case. Because we have two distinct representations for the same entity, and those representations are not bijective, we need extra code to handle conversion.

https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L97-L103
also
https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L112
several lines of docs here:
https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/api/src/main/java/org/apache/iceberg/UpdateSchema.java#L54-L58

We are even inclined to document at some places that a point is nothing special here:

 * <p>The given name is used to name the new column and names containing "." are not handled
 * differently.

All of this could be solved by using a globaly unique identifier for a field - which for me is very clearly Vec<String> and just handle conversion to a string only representation at the last moment.

I stumbled across this again when implementing SchemaUpdate. So lets forget about the use-case, but maybe still consider harmonizing identifiers for a field.

apache deleted a comment Aug 29, 2024

c-thiel changed the title ~~Schema: Allow field name foo.bar even if struct foo->bar is present~~ Harmonize representations for field names to Vec<String> Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harmonize representations for field names to `Vec<String>` #591

Harmonize representations for field names to `Vec<String>` #591

c-thiel commented Aug 29, 2024

c-thiel commented Aug 29, 2024

liurenjie1024 commented Aug 29, 2024

liurenjie1024 commented Aug 29, 2024

yukkit commented Sep 7, 2024

rdblue commented Sep 16, 2024

c-thiel commented Nov 13, 2024

Harmonize representations for field names to Vec<String> #591

Harmonize representations for field names to Vec<String> #591

Comments

c-thiel commented Aug 29, 2024

c-thiel commented Aug 29, 2024

liurenjie1024 commented Aug 29, 2024

liurenjie1024 commented Aug 29, 2024

yukkit commented Sep 7, 2024

rdblue commented Sep 16, 2024

c-thiel commented Nov 13, 2024

Harmonize representations for field names to `Vec<String>` #591

Harmonize representations for field names to `Vec<String>` #591