Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates catalog names for case-insensitive lookup #1510

Merged
merged 1 commit into from
Jul 16, 2024

Conversation

RCHowell
Copy link
Member

Relevant Issues

#1496

Description

Adds session, path, and identifiers to the catalog package with support for case-insensitive identifiers.

License Information

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@RCHowell RCHowell requested a review from johnedquinn July 11, 2024 23:01
@RCHowell RCHowell mentioned this pull request Jul 15, 2024
17 tasks
Comment on lines +38 to +39
if (ignoreCase && !(this.identifier.matches(other.identifier))) {
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of ignoreCase here? If I write (pseudo-code):

Id("geh").matches(Id("geH"), ignoreCase = true)

Assume both of the above are delimited.

The above won't actually ignore the case. The Part.matches requires that one of them be non-delimited.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond this, what is the purpose of matches? From what I'm gathering, Name is the resolved unix-style path of a DB object, whereas Identifier is unresolved. Why would you want to compare two identifiers? In the current implementation, we have BindingPath.matches(ConnectorPath) (AKA unresolved.matches(resolved))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that the above won't actually ignore the case (because these are delimited), but consider if one or more of the identifiers is "regular" but the planner is in case-sensitive mode.

The ignoreCase = false is an explicit way to ignore regular vs. delimited and just compare the text.

On the second point, what is the purpose of matches? I don't actually believe we need to compare two identifiers, but I do know the catalog resolution implementation will be different than bindingpath.matches(connectorpath) of today. I can remove the match logic later.

/**
* Catalogs is used to provide the default catalog and possibly others by name.
*/
public interface Catalogs {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be public?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the idea here is to abstract the Map<String, ConnectorMetadata> from the session so that a customer may implement their own catalog provider. I understand that you can have a default impl (which is really just a map) that you construct with the planner builder, but consider something like Catalogs backed by a DynamoDB implementation.

Copy link
Contributor

@yliuuuu yliuuuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to spend more time on the base branch v1-metadata, to understand more on how planner / spi uses those interface.

Here is my understanding on the subject of identifier case sensitivity.

Consider the following PartiQL Query, planned in CASE-SENSITIVE Mode for reduced complicity

SELECT tbl.a FROM TBL as tbl

We should have a normalization pass to produce an equivalent query:

SELECT VALUE {'c' : "tbl"."a"} FROM "TBL" as "tbl"

After the normalization, during the planning stage, our planner should be request metadata on "TBL". This is: a case-sensitive identifier with symbol "TBL".

In the specific connector, the connector implementer should be able to make a conscious choice on whether to honor the case sensitivity flag requested by PartiQL.

In other words: the connector implementor may choose to honor the case sensitivity rules of partiql, and only search for exact match of "TBL", or it could possible do matching with case ignored or even folding down the case.

In which case the connector implementer is responsible for configuring PartiQL and implement the connector-specific lookup rule.

In the above example, the retrieval of from source metadata (metadata of "TBL") is the only time we need to call the connector, post from source, we bind the "TBL" with "tbl" and everything in within PartiQL.

Assume the DDL on for relation "TBL" is:

CREATE TABLE "TBL" (
    a INT2
)

Case One: "FooConnector", which connects to a relational database "Foo" that lowercase the case insensitive identifier.

retrieve of "TBL" -> BAG<STRUCT<a: INT2>>

Projection: "tbl"."a" -> INT2

Case Two: "BarConnector", which connects to a relational database "Bar" that uppercase the case insensitive identifier.

retrieve of "TBL" -> BAG<STRUCT<A: INT2>>

Projection: "tbl"."a" -> Unresolved.

In which case: this is a bad configuration in regards to PartiQL mode.

Case three: "BazConnector", which connects to a relational database "Baz" that dishonor delimited identifier and lower case everything:

In which case, the implementation of Baz Connector, upon receiving request for "TBL", should consider searching for lowered case "tbl"

@RCHowell
Copy link
Member Author

@yliuuuu please consider this outside the context of connector/spi as that is merely a pattern for implementing a catalog.

After the normalization, during the planning stage, our planner should be request metadata on "TBL". This is: a case-sensitive identifier with symbol "TBL".
In the specific connector, the connector implementer should be able to make a conscious choice on whether to honor the case sensitivity flag requested by PartiQL.

I argue they MUST honor the sensitivity flag if we want to have consistent semantics.

In other words: the connector implementor may choose to honor the case sensitivity rules of partiql, and only search for exact match of "TBL", or it could possible do matching with case ignored or even folding down the case.

The implementer must honor the case configuration or else it could lead to inconsistent behavior.

In which case the connector implementer is responsible for configuring PartiQL and implement the connector-specific lookup rule.

The customer is indeed responsible for lookup, this is precisely their implementation of either

- getTable(name: Name) // get by exact/resolved name
- getTable(identifier: Identifier) // lookup based on identifier (MAYBE case-insensitive)

In the above example, the retrieval of from source metadata (metadata of "TBL") is the only time we need to call the connector, post from source, we bind the "TBL" with "tbl" and everything in within PartiQL.

Assume the DDL on for relation "TBL" is:

CREATE TABLE "TBL" (a INT2)

Case One: "FooConnector", which connects to a relational database "Foo" that lowercase the case insensitive identifier.
retrieve of "TBL" -> BAG<STRUCT<a: INT2>>
Projection: "tbl"."a" -> INT2

I think you are misunderstanding what the identifier is. If you look more closely, you will see that an identifier is composed of parts which are either regular or delimited. We send the identifier to a catalog implementation to get any associated metadata.

It looks more like this,

SELECT tbl.a FROM TBL as tbl

val tbl = env.getTable(Identifier.regular("TBL")) // we have a REGULAR identifier (not-delimited)
val schema = tbl.getSchema()

println(schema)

// << { a: int2 } >> 

Case Two: "BarConnector", which connects to a relational database "Bar" that uppercase the case insensitive identifier.
retrieve of "TBL" -> BAG<STRUCT<A: INT2>>
Projection: "tbl"."a" -> Unresolved.

Again we only forward that the identifier is REGULAR (not delimited) — it is the responsibility of the catalog implementor to handle the regular identifier resolution based upon their system.

In which case: this is a bad configuration in regards to PartiQL mode.
Case three: "BazConnector", which connects to a relational database "Baz" that dishonor delimited identifier and lower case everything:
In which case, the implementation of Baz Connector, upon receiving request for "TBL", should consider searching for lowered case "tbl"

This is correct! but only when the identifier is REGULAR. If the identifier is DELIMITED you cannot change its case. However, if the identifier is REGULAR then the catalog is responsible for its matching ie case normalization or insensitive matching.

@yliuuuu
Copy link
Contributor

yliuuuu commented Jul 15, 2024

My argument is that PartiQL only have case insensitive identifier in "Case-Insensitive" mode.

The PartiQL Semantics for

SELECT tbl.a FROM TBL as "tbl"

is mode dependent.

Assuming case preservation on:

-- Folding up
-- asking for regular("TBL", regular = false)
SELECT "TBL"."A" FROM "TBL" as "tbl"

-- Folding down
-- asking for ("tbl", regular = false)
SELECT "tbl"."a" FROM "tbl" as "tbl"

-- Case Sensitive
-- asking for ("TBL", regular = false)
SELECT "tbl"."a" FROM "TBL" as "tbl"

-- Case Insensitive
-- asking for ("TBL", regular= true)
SELECT tbl.a FROM TBL as "tbl"

I argue they MUST honor the sensitivity flag if we want to have consistent semantics.

I think the invocation to connector is a data/metadata retrieval only. It does not effect PartiQL Semantic.

Ideally with the different modes their should be not need to do the custom normalization logic unless they are in "case-insensitive" mode, but we have no control over how connectors implement the retrieval function.

@RCHowell
Copy link
Member Author

I think the invocation to connector is a data/metadata retrieval only. It does not effect PartiQL Semantic.

It does because it determines what "TBL" actually refers to!

My argument is that PartiQL only have case insensitive identifier in "Case-Insensitive" mode.

By normalization I mean we literally rewrite the AST based upon the mode.

-- input
select a from tbl as t

-- normalized in case-sensitive mode
select "a" from "tbl" as t

Put quotes on everything, then we send ("tbl", regular = false) to the catalog APIs — but you are right that we have no control over whether or not a customer respects the regular flag.

@RCHowell RCHowell closed this Jul 16, 2024
@RCHowell RCHowell reopened this Jul 16, 2024
@RCHowell RCHowell merged commit 1c3085b into v1-metadata Jul 16, 2024
8 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants