Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize representations for field names to Vec<String> #591

Open
c-thiel opened this issue Aug 29, 2024 · 6 comments
Open

Harmonize representations for field names to Vec<String> #591

c-thiel opened this issue Aug 29, 2024 · 6 comments

Comments

@c-thiel
Copy link
Contributor

c-thiel commented Aug 29, 2024

Currently due to the way name-to-id is build, we cannot have points in columnames if it collides with a struct.

The following schema fails to build:

        let schema = Schema::builder()
            .with_schema_id(1)
            .with_fields(vec![
                NestedField::required(1, "foo", Primitive(PrimitiveType::String)).into(),
                NestedField::required(2, "foo.bar", Primitive(PrimitiveType::Int)).into(),
                NestedField::required(
                    3,
                    "foo",
                    Type::Struct(StructType::new(vec![NestedField::required(
                        4,
                        "bar",
                        Primitive(PrimitiveType::Int),
                    )
                    .into()])),
                )
                .into(),
            ])
            .build()
            .unwrap();

By prohibiting this we follow the Java implementation.
There is nothing in the iceberg spec that prohibits these names, so I think we should allow them.
For column names as a user I expect the same behaviors as for namespaces or databases with points - it needs to be escaped. As escaping depends on the query engine, and iceberg-rust as well as iceberg java has no SQL-parser, those libraries should not take away the option from the engine or make that decision in their stead.

I propose to change the representation of a colname to Vec<String> instead of just "String with points". It would also make accessor compatible between schemas - even if we decide to stick keep this artificial restriction.

@c-thiel
Copy link
Contributor Author

c-thiel commented Aug 29, 2024

CC @Fokko , @nastra , @Xuanwo

@apache apache deleted a comment Aug 29, 2024
@liurenjie1024
Copy link
Contributor

Allowing such a case would also require a change on the table scan api, e.g. we need a way to allow user to tell use what foo.bar actually means.

@liurenjie1024
Copy link
Contributor

Also cc @rdblue

@yukkit
Copy link
Contributor

yukkit commented Sep 7, 2024

FWIW. Can we introduce a structure similar to TableIdent to represent field paths, named 'FieldPath'?

  1. We could modify the elements of selected TableScan::column_names from string to FieldPath (and provide some utilities to simplify the construction of FieldPath).
  2. Change the key in name_to_id to FieldPath.
  3. Perhaps the logic for looking up fields via FieldPath would also need to be modified.

@rdblue
Copy link

rdblue commented Sep 16, 2024

This is an implementation detail that is not part of the spec. But I don't think it is worth bothering to enable both ["foo.bar"] and ["foo", "bar"] identifiers in the same schema. That's confusing for users, who will almost certainly be confused by tables with such odd structures.

The reference implementation has been this way for about 7 years and no one has every complained or, to my knowledge, hit a problem with this in practice. I highly recommend focusing time and effort on other improvements.

@c-thiel
Copy link
Contributor Author

c-thiel commented Nov 13, 2024

This is not so much about this specific use case - which I also don't care about much either, but about having two different representations for the same entity. Took me a while to formulate it so clearly.
There are many lines of code in different files just to work around this problem.

Let's assume for example we want to add a field to a schema, then the point representation in the PartitionSpec is not compatible with the one in the Schema. Java would just throw an exception in this case. Because we have two distinct representations for the same entity, and those representations are not bijective, we need extra code to handle conversion.

https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L97-L103
also
https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/core/src/main/java/org/apache/iceberg/SchemaUpdate.java#L112
several lines of docs here:
https://github.com/apache/iceberg/blob/e06b069529be3d3d389b156646e751de3753feb0/api/src/main/java/org/apache/iceberg/UpdateSchema.java#L54-L58

We are even inclined to document at some places that a point is nothing special here:

 * <p>The given name is used to name the new column and names containing "." are not handled
 * differently.

All of this could be solved by using a globaly unique identifier for a field - which for me is very clearly Vec<String> and just handle conversion to a string only representation at the last moment.

I stumbled across this again when implementing SchemaUpdate. So lets forget about the use-case, but maybe still consider harmonizing identifiers for a field.

@c-thiel c-thiel changed the title Schema: Allow field name foo.bar even if struct foo->bar is present Harmonize representations for field names to Vec<String> Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants