[v1] Remove PartiQLValue from AST; refactor AST literals #1650

alancai98 · 2024-11-19T01:48:44Z

Relevant Issues

Closes [V1] Remove PartiQLValue from the AST #1589

Description

Removes PartiQLValue from AST
Refactors AST literals to own representation

Other Information

Updated Unreleased Section in CHANGELOG: [NO]
- No on v1 branch.
Any backward-incompatible changes? [YES]
- Yes but on v1 branch
Any new external dependencies? [NO]
Do your changes comply with the Contributing Guidelines
and Code Style Guidelines? [YES]

License Information

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

github-actions · 2024-11-19T01:56:57Z

CROSS-ENGINE-REPORT ❌

	BASE (LEGACY-V0.14.8)	TARGET (EVAL-503D653)	+/-
% Passing	89.67%	94.39%	4.72% ✅
Passing	5287	5565	278 ✅
Failing	609	50	-559 ✅
Ignored	0	281	281 🔶
Total Tests	5896	5896	0 ✅

Testing Details

Base Commit: v0.14.8
Base Engine: LEGACY
Target Commit: 503d653
Target Engine: EVAL

Result Details

❌ REGRESSION DETECTED. See Now Failing/Ignored Tests. ❌
Passing in both: 2643
Failing in both: 17
Ignored in both: 0
PASSING in BASE but now FAILING in TARGET: 3
PASSING in BASE but now IGNORED in TARGET: 108
FAILING in BASE but now PASSING in TARGET: 180
IGNORED in BASE but now PASSING in TARGET: 0

Now FAILING Tests ❌

The following 3 test(s) were previously PASSING in BASE but are now FAILING in TARGET:

Click here to see

undefinedUnqualifiedVariableWithUndefinedVariableBehaviorMissing, compileOption: PERMISSIVE
undefinedUnqualifiedVariableIsNullExprWithUndefinedVariableBehaviorMissing, compileOption: PERMISSIVE
undefinedUnqualifiedVariableIsMissingExprWithUndefinedVariableBehaviorMissing, compileOption: PERMISSIVE

Now IGNORED Tests ❌

The complete list can be found in GitHub CI summary, either from Step Summary or in the Artifact.

Now Passing Tests

180 test(s) were previously failing in BASE (LEGACY-V0.14.8) but now pass in TARGET (EVAL-503D653). Before merging, confirm they are intended to pass.

The complete list can be found in GitHub CI summary, either from Step Summary or in the Artifact.

CROSS-COMMIT-REPORT ✅

	BASE (EVAL-F5C6EFF)	TARGET (EVAL-503D653)	+/-
% Passing	94.39%	94.39%	0.00% ✅
Passing	5565	5565	0 ✅
Failing	50	50	0 ✅
Ignored	281	281	0 ✅
Total Tests	5896	5896	0 ✅

Testing Details

Base Commit: f5c6eff
Base Engine: EVAL
Target Commit: 503d653
Target Engine: EVAL

Result Details

Passing in both: 5565
Failing in both: 50
Ignored in both: 281
PASSING in BASE but now FAILING in TARGET: 0
PASSING in BASE but now IGNORED in TARGET: 0
FAILING in BASE but now PASSING in TARGET: 0
IGNORED in BASE but now PASSING in TARGET: 0

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralDecimal.java

alancai98 · 2024-11-19T01:54:44Z

partiql-ast/src/main/java/org/partiql/ast/Explain.java

    @NotNull
-    public final Map<String, PartiQLValue> options;
+    public final Map<String, Literal> options;


(self-review) model the EXPLAIN option map values w/ a Literal directly. Could also opt for ExprLit but wasn't sure

Literal makes sense

alancai98 · 2024-11-19T01:55:34Z

partiql-ast/build.gradle.kts

-    api(project(":partiql-types"))
-    // TODO REMOVE ME ONCE PartiQLValue IS REMOVED
-    // THE AST NEEDS ITS OWN "VALUE" REPRESENTATION
-    api(project(":partiql-spi"))


(self-review) partiql-ast now will not depend on other packages. Had to add the spi dependency to partiql-parser following this change

alancai98 · 2024-11-19T01:56:41Z

partiql-ast/src/main/java/org/partiql/ast/sql/SqlDialect.kt

-            val newArgs = listOf(exprLit(symbolValue(dtField))) + node.args.drop(1)
+            val dtField = ((node.args[0] as ExprLit).lit as LiteralString).value
+            // Represent as an `ExprVarRef` to mimic a literal symbol.
+            // TODO consider some other representation for unquoted strings


(self-review) I couldn't find a better way to represent the previous capability of printing PartiQLValue symbols (i.e. text without any quoting). For now just represent with a var ref.

alancai98 · 2024-11-19T01:57:29Z

partiql-ast/src/main/java/org/partiql/ast/sql/SqlDialect.kt

+            DataType.TIME, DataType.TIMESTAMP -> tail concat type(node.name(), node.precision, gap = true)
+            DataType.TIME_WITH_TIME_ZONE -> tail concat type("TIME", node.precision, gap = true) concat(" WITH TIME ZONE")
+            DataType.TIMESTAMP_WITH_TIME_ZONE -> tail concat type("TIMESTAMP", node.precision, gap = true) concat(" WITH TIME ZONE")


(self-review) was a bug w/ prior pretty-printing. Time and timestamp precision should follow the time/timestamp keyword.

alancai98 · 2024-11-19T02:01:54Z

partiql-planner/src/main/kotlin/org/partiql/planner/internal/transforms/RexConverter.kt

+                is LiteralInt -> PType.integer()
+                is LiteralLong -> PType.bigint()
+                is LiteralDouble -> PType.real()
+                is LiteralTypedString -> {


(self-review) move the datetime literal validation to the ast->plan conversion.

alancai98 · 2024-11-19T02:04:13Z

Marking as a draft to look at the conformance test failures.

RCHowell · 2024-11-19T21:02:26Z

partiql-ast/src/main/java/org/partiql/ast/Explain.java

    @NotNull
-    public final Map<String, PartiQLValue> options;
+    public final Map<String, Literal> options;


Literal makes sense

RCHowell · 2024-11-19T21:05:53Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralInt.java

+public class LiteralInt extends Literal {
+    public int value;
+
+    public LiteralInt(int value) {
+        super(String.format("%d", value));
+        this.value = value;
+    }
+}


I don't believe there should be both a LiteralInt and LiteralLong, there should be one "integral literal" which may just be called LiteralInteger.

To be honest, it would be nice to just have LiteralNumber which has the necessary APIs / details to model floating point, decimals, and integers. But also there are places in the BNF where we might want to enforce exact numeric or integral.

Yeah -- the literals, especially the numerics, had a lot of different ways to model. I played around w/ some different approaches but wasn't yet sure of which stage (i.e. parser, ast, planner) should perform the validation and conversion of the literal text.

I'll sync offline w/ you and @johnedquinn on where the text -> value conversion should take place and where to perform any validation.

partiql-planner/src/main/kotlin/org/partiql/planner/internal/transforms/RexConverter.kt

alancai98 · 2024-11-21T01:17:22Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralApprox.java

@@ -0,0 +1,52 @@
+package org.partiql.ast.literal;


I discussed a different modeling with @RCHowell using just one class rather than three for the numerics (LiteralApprox, LiteralExact, LiteralInt) that used a enum/kind to distinguish between the three numeric types.

public class LiteralNumber extends Literal { @Nullable private final Long p1; @Nullable private final BigDecimal p2; @NotNull public final ParseContext kind; private LiteralNumber(@Nullable Long p1, @Nullable BigDecimal p2, @NotNull ParseContext kind) { this.p1 = p1; this.p2 = p2; this.kind = kind; } // Factory methods public static LiteralNumber integer(long value) { return new LiteralNumber(value, null, ParseContext.INTEGER); } public static LiteralNumber integer(int value) { return new LiteralNumber((long) value, null, ParseContext.INTEGER); } public static LiteralNumber exact(BigDecimal value) { return new LiteralNumber(null, value, ParseContext.EXACT); } public static LiteralNumber approx(BigDecimal value, long exponent) { return new LiteralNumber(exponent, value, ParseContext.APPROX); } public static LiteralNumber approx(BigDecimal value) { return new LiteralNumber(null, value, ParseContext.APPROX); } public static LiteralNumber approx(float value) { return approx(BigDecimal.valueOf(value)); } public static LiteralNumber approx(double value) { return approx(BigDecimal.valueOf(value)); } // Getting the value out @NotNull public BigDecimal getDecimal() { if (kind == ParseContext.INTEGER) { return p2; } else { throw new IllegalStateException("Unknown context: " + kind); } } public long getInteger() { if (kind == ParseContext.INTEGER) { return p1; } else { throw new IllegalStateException("Unknown context: " + kind); } } // similarly for getDouble // TODO if we keep this representation, change to extend `AstEnum` public enum ParseContext { INTEGER, EXACT, APPROX } @NotNull @Override public String getText() { switch (kind) { // Since they're nullable, can be slightly annoying to extract the value but it's internal code case INTEGER: assert p1 != null; return p1.toString(); case EXACT: assert p2 != null; return p2.toString(); case APPROX: assert p1 != null; assert p2 != null; return p2 + "e" + p1; default: throw new IllegalStateException("Unknown context: " + kind); } } }

and if we wanted to internalize the enum, could add some additional methods

public boolean isInteger() { return kind == ParseContext.INTEGER; } public boolean isExact() { return kind == ParseContext.EXACT; } public boolean isApprox() { return kind == ParseContext.APPROX; }

imo, the customer code using these different modelings is pretty similar. But I feel like the internal implementation code is simpler if we break apart the LiteralNumeric class into separate classes since we don't have to do the null checks or casing on the enum/kind.

Using the single class with an enum

val v = when (lit) { is LiteralNumber -> { val kind = lit.kind when (kind) { LiteralNumber.ParseContext.EXACT -> { lit.decimal } LiteralNumber.ParseContext.INTEGER -> { lit.integer } LiteralNumber.ParseContext.APPROX -> { lit.double } else -> error("Unexpected numeric literal: $lit") } } else -> error("Unexpected literal: $lit") }

Not using the enum directly but using helper methods

val v2 = when (lit) { is LiteralNumber -> { if (lit.isExact) { lit.decimal } else if (lit.isInteger) { lit.integer } else if (lit.isApprox) { lit.double } else { error("Unexpected numeric literal: $lit") } } else -> error("Unexpected literal: $lit") }

Using current PR's three classes

val v3 = when (lit) { is LiteralInteger -> lit.integer is LiteralExact -> lit.decimal is LiteralApprox -> lit.double else -> error("Unexpected literal: $lit") }

Sent a comment elsewhere, but we don't have to do the null checks. This is internal code which will throw a NPE if we have a bug whereas now it throws an assertion exception, again if we have a bug. The checks don't gain us anything because an exception is thrown either way.

RCHowell · 2024-11-21T21:37:22Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralApprox.java

@@ -0,0 +1,52 @@
+package org.partiql.ast.literal;


Sent a comment elsewhere, but we don't have to do the null checks. This is internal code which will throw a NPE if we have a bug whereas now it throws an assertion exception, again if we have a bug. The checks don't gain us anything because an exception is thrown either way.

RCHowell · 2024-11-21T21:42:20Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralExact.java

+    @NotNull
+    public BigDecimal getDecimal() {
+        return value;
+    }


What if we want to get the exact numeric as an int or long? Also float or double?

What is missing here is an interface LiteralNumber which has all of the toInt(), toFloat(), toLong() etc. methods. Because all of these methods should be shared for the literal number, I don't see that the additional classes actually gains you much. Also now the factory methods are spread out. Why factory methods if we already have the classes split AND constructors? If you want to split classes, use a common interface and a factory method. Then when you step back you will see that the split classes gains you little at the expense of complexity elsewhere (visitors/rewriters/printers/etc.)

Discussed offline. I had not considered those interface methods in the customer code I was testing out. Agree with your comment that the typical OO approach (i.e. effective java item 23 to prefer class hierarchies to tagged classes) may not be what we want here. Something more akin to PType, Datum, and AnyElement could offer a more cohesive interface.

As mentioned by @RCHowell, the Calcite follows the tagged class approach using their SqlTypeName property -- https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/sql/SqlLiteral.java. I'll try out a similar approach for literal modeling.

johnedquinn · 2024-11-22T20:13:36Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralApprox.java

+    private final BigDecimal mantissa;
+
+    private final int exponent;


Why wouldn't these be public?

TBH I'm not quite sure what needs to be public atm. We could always add helper functions in the future.
When looking at an approximate numeric literal, are there use cases where an API user would actually need just the mantissa or just the exponent? The use cases we currently have use both the arguments (e.g. pretty-printing, AST->plan conversion).

johnedquinn · 2024-11-22T20:31:27Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralApprox.java

+ * TODO docs
+ */
+@EqualsAndHashCode(callSuper = false)
+public class LiteralApprox extends Literal {


With this, I'm struggling to understand why the LiteralInteger class is necessary, especially since the EBNF for approximate numeric is:

<approximate numeric literal> ::= <mantissa> E <exponent> <mantissa> ::= <exact numeric literal>

With the modeling here, you're implicitly saying that BigDecimal is sufficient for the mantissa (an exact numeric literal, which encompasses the LiteralInteger and LiteralExact variants). Which makes me wonder why LiteralInteger is distinct from LiteralExact.

I had originally thought this as well. There are some edge cases here. For example, 123 and 123. cannot both be represented by a BigDecimal alone since BigDecimal doesn't preserve whether the 123 mantissa is specified with or without a decimal point.

We could provide some boolean field to distinguish between the two (e.g. hasPeriod); however I came to release that approximate numeric also has this same issue and may need that field as well. Example: 123e0 vs 123.e0. To exactly preserve the text, we would even need to distinguish whether the approximate numeric's exponent had an implicit plus (123e0), explicit plus (123e+0), or explicit minus (123e-0).

This led me to think perhaps we should just be passing around strings for the AST representation of numerics, leaving the parser to perform simple validation and type tagging.

johnedquinn · 2024-11-22T20:32:25Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralExact.java

+    }
+
+    @NotNull
+    public static LiteralExact litExact(BigInteger value) {


Along with my other comment in LiteralApprox, I think it could be useful to have constructors for long and int here.

We could always add them later. With the current design, I didn't actually need a factory method or constructor to create an exact numeric literal using a long or int.

johnedquinn · 2024-11-22T20:42:36Z

partiql-ast/src/main/java/org/partiql/ast/literal/LiteralTypedString.java

+    public static LiteralTypedString litTypedString(@NotNull DataType type, @NotNull String value) {
+        return new LiteralTypedString(type, value);
+    }


To make sure people don't put in bad types, I'd consider making this a final class and then exchanging this method for litTime(int precision, @NotNull String value), litTimeZ(int precision, @NotNull String value), and so on and so forth.

Or, just making them their own classes (LiteralTime, LiteralTimestamp, LiteralDate). I imagine it would have the same benefits that you described in https://github.com/partiql/partiql-lang-kotlin/pull/1650/files#r1851189538, and it would more closely match the EBNF and more straightforward.

To make sure people don't put in bad types, I'd consider making this a final class and then exchanging this method for litTime(int precision, @NotNull String value), litTimeZ(int precision, @NotNull String value), and so on and so forth.

I like the idea to create date/time/timestamp-specific factory methods. Perhaps I could add those after I get the general design approved first? We could always add these in a subsequent PR or release.

Or, just making them their own classes (LiteralTime, LiteralTimestamp, LiteralDate). I imagine it would have the same benefits that you described in #1650 (files), and it would more closely match the EBNF and more straightforward.

For now, I liked the modeling of these <data type > '<string repr in quotes>' to be under one class and reuse the data type enum. The parser does perform some validation to confirm that the strings match the EBNF rules.

johnedquinn · 2024-11-22T20:56:36Z

partiql-parser/src/main/antlr/PartiQLTokens.g4

+LITERAL_DECIMAL
+    : DIGIT+ '.' DIGIT*
+    | '.' DIGIT+
+    ;
+
+LITERAL_FLOAT
+    : DIGIT+ ('.' DIGIT*)? 'E' [+-]? DIGIT+
+    | '.' DIGIT+ 'E' [+-]? DIGIT+


If you want to fix the G4 to match the EBNF now regarding +/-, you can. Though, non-blocking. These could be parser rules rather than lexing rules. That way, you can simplify how you write this (and the PartiQLParserDefault).

Suggested change

LITERAL_DECIMAL

: DIGIT+ '.' DIGIT*

| '.' DIGIT+

;

LITERAL_FLOAT

: DIGIT+ ('.' DIGIT*)? 'E' [+-]? DIGIT+

| '.' DIGIT+ 'E' [+-]? DIGIT+

literalNumeric

: SIGN? (literalExactNumeric | literalApproxNumeric)

;

literalExactNumeric

: INT

| lhs=INT PERIOD rhs=INT?

| PERIOD INT

;

literalApproxNumeric

: mantissa=literalExactNumeric 'E' exponent=signedInt

;

I played around with some different ANTLR parsing modelings and it led to some other issues. I think the functionality is fine to parse as unsigned numerics. We could always change the modeling in the future or add some AstVisitor pass to unify unary +/- with AST numeric literals.

Changing literal parsing to be in the parser rather than lexer -- cannot use the single-quoted 'E', so E would need to be in the tokens.

Changing literal parsing in the tokens may bind too eagerly -- e.g. 1 - 23 would tokenize -23 together.

alancai98 · 2024-12-06T04:43:32Z