If a precision timestamp literal is encountered there is no way to know the precision #611

westonpace · 2024-03-19T19:17:12Z

Currently precision timestamp literals are defined as:

      // If the precision is 6 or less then this is the microseconds since the UNIX epoch
      // If the precision is more than 6 then this is the nanoseconds since the UNIX epoch
      uint64 precision_timestamp = 34;
      uint64 precision_timestamp_tz = 35;

However, there is no way to know if the precision is 6 or less because the literal doesn't have to be associated with a type. For example, consider:

SELECT * FROM my_table WHERE my_ts_col < timestamp (3) '2004-10-19 10:23:54.123'

This will turn into a plan with a literal value in the "less than" expression:

"scalarFunction":{
  "functionReference":3,
  "outputType":{..},
  "arguments":[
    { ... },
   ...
   "literal":{ "precision_timestamp": 1098181434123000 }

However, there is no way of knowing what the precision of that timestamp is. I could infer it by tracing back and realizing that the other value in the expression has precision X but:

That's a lot of work, and I don't have to do this for other literals
They don't technically have to be the same, depending on what range of functions the producer / consumer supports

The text was updated successfully, but these errors were encountered:

EpsilonPrime · 2024-03-21T02:03:52Z

This behavior is identical to what happens with fixed_char, var_char, and decimal literals. The way this is addressed in Isthmus is to immediately cast the literal to the appropriate precision. For instance, a fixed character would be represented with the string "fruit" and then later is cast to the actual desired with such as fixed_char[20].

westonpace · 2024-03-21T02:54:09Z

First, I'm not sure that's true with varchar or decimal, the precision is part of the literal:

    message VarChar {
      string value = 1;
      uint32 length = 2;
    }

    message Decimal {
      // little-endian twos-complement integer representation of complete value
      // (ignoring precision) Always 16 bytes in length
      bytes value = 1;
      // The maximum number of digits allowed in the value.
      // the maximum precision is 38.
      int32 precision = 2;
      // declared scale of decimal literal
      int32 scale = 3;
    }

With fixed_char / fixed_binary I can at least form some kind of "minimum precision literal". E.g. given the literal "foo" I can turn it into a fixed_char<3>.

I can't do that with timestamp. Given the literal 1710989532 how am I supposed to know if this is:

Timestamp(0) - Thursday, March 21, 2024 2:52:12 AM
Timestamp(3) - Tuesday, January 20, 1970 7:16:29.532 PM
Timestamp(6) - Thursday, January 1, 1970 12:28:30.989532 AM
Timestamp(9) - Thursday, January 1, 1970 12:00:01.710989532 AM

All of those are valid timestamps.

EpsilonPrime · 2024-03-21T04:54:31Z

The core issue is that there isn't a fixed issue of the literal (it's either microseconds or nanoseconds but not both). I'd argue that the definition should just be nanoseconds since epoch.

Having the precision in the type would certainly make things easier for consumers (by greatly reducing the number of casts where literals are present).

The precision of varchar and decimal might be related to an older specification version, I've got examples of explicit casting to add precision. It also might be that the casting is there because the SQL parser is recognizing string an integer values and needed to cast them after the fact.

westonpace · 2024-03-21T11:27:34Z

Ah, good point, there are really only two possibilities, not four.

Yes, if it was just nanoseconds then that would be easier. That was part of the original proposal. Unfortunately, the range of 64-bit nanosecond timestamps is only about ~550 years. At the time we declared 10k years was the range of our timestamp data type. I believe we ended up dropping any declaration of range as part of the PR.

Still, nanoseconds-only would be rather limiting (we support 10k years on date, interval_year and interval_day).

However, if we want to do nanoseconds-only, then we could make it a 128-bit integer.

It also might be that the casting is there because the SQL parser is recognizing string an integer values and needed to cast them after the fact.

That seems very possible. In SQL there are only a few literal types like strings (e.g. 'foo') and numbers (e.g. 11). Even timestamps are just "strings" in the beginning. Late inference is very much a necessity. The fact that Substrait doesn't require this is definitely convenient for consumers.

Although, some of that late inference is still often needed anyways. SQL parsers will often turn a numeric literal either into the smallest available integer type or just always use u64. Either way a consumer will probably have to apply some kind of cast to apply the literal in an operation so that it can fit the literal to the data type.

westonpace · 2024-03-21T11:29:09Z

However, if we want to do nanoseconds-only, then we could make it a 128-bit integer.

That being said, I would still rather just declare the precision as part of the message. This is more consistent with our current definition of decimal / varchar and then I don't have to worry about dealing with bigint.

EpsilonPrime · 2024-03-21T22:46:44Z

I'm onboard with fully specifying the literal. We should also include the missing timezone part in the tz variant as well.

westonpace · 2024-03-22T15:05:05Z

We should also include the missing timezone part in the tz variant as well.

I'm not sure what you mean by this. The timestamp_tz data type does not store a timezone-per-cell. What is the missing timezone part?

### Rationale for this change See #40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: #40695 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Weston Pace <[email protected]>

### Rationale for this change See apache#40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: apache#40695 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Weston Pace <[email protected]>

Blizzara · 2024-07-12T13:59:33Z

Done by #659

westonpace mentioned this issue Mar 20, 2024

GH-40695 [C++] Expand Substrait type support apache/arrow#40696

Merged

westonpace mentioned this issue Mar 22, 2024

[C++] Add support for precision timestamp literals apache/arrow#40741

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If a precision timestamp literal is encountered there is no way to know the precision #611

If a precision timestamp literal is encountered there is no way to know the precision #611

westonpace commented Mar 19, 2024

EpsilonPrime commented Mar 21, 2024

westonpace commented Mar 21, 2024

EpsilonPrime commented Mar 21, 2024

westonpace commented Mar 21, 2024

westonpace commented Mar 21, 2024

EpsilonPrime commented Mar 21, 2024

westonpace commented Mar 22, 2024

Blizzara commented Jul 12, 2024 •

edited

Loading

If a precision timestamp literal is encountered there is no way to know the precision #611

If a precision timestamp literal is encountered there is no way to know the precision #611

Comments

westonpace commented Mar 19, 2024

EpsilonPrime commented Mar 21, 2024

westonpace commented Mar 21, 2024

EpsilonPrime commented Mar 21, 2024

westonpace commented Mar 21, 2024

westonpace commented Mar 21, 2024

EpsilonPrime commented Mar 21, 2024

westonpace commented Mar 22, 2024

Blizzara commented Jul 12, 2024 • edited Loading

Blizzara commented Jul 12, 2024 •

edited

Loading