Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ion String to Date type conversion is not supported #100

Open
cheqianh opened this issue Oct 4, 2022 · 2 comments
Open

Ion String to Date type conversion is not supported #100

cheqianh opened this issue Oct 4, 2022 · 2 comments

Comments

@cheqianh
Copy link
Contributor

cheqianh commented Oct 4, 2022

Given an Ion file

{next_date: "2022-09-06"}

and a schema

CREATE EXTERNAL TABLE test(
  `next_date` date
)
...

ion-hive-serde will throw an error com.amazon.ion.impl.lite.IonStringLite cannot be cast to com.amazon.ion.IonTimestamp because an Ion string (within double quote above) cannot be cast to Hive date type in ion-hive-serde today.

More specifically, we rely on object inspectors to convert Ion timestamp to Hive date but there is no Ion string to Hive date conversion support in ion-hive-serde.

Should we support conversion from Ion string to the Hive temporal types?

@cheqianh cheqianh changed the title Ion String to Date type is not supported Ion String to Date type conversion is not supported Oct 4, 2022
@adellahlou
Copy link

I'll share my thoughts. In my opinion, Ion defines a type spec to support different use cases (for example it supports decimal with arbitrary precision). Engines may not be able to fully support all of Ion's type spec, but can make a best effort to adhere as much as possible. "Best effort" is ambiguous and will look different if you leave it up to engines or end users to work around the limitations of the Ion SerDe. This hurts the portability of Ion across different engines.

If the serdes does not define these conversions, then end users will either make their own parsing functions/libraries (devaluing this library) or rely on engines casting or functions (devaluing Ion's type spec, but adhering to the engines Type Spec). This means customers may get different values depending on the engine (example Presto vs Spark casts) or different values depending on their function implementation.

If you define these transformations inside of the SerDes, then the Ion Hive SerDe becomes the de facto standard for "best effort support" of Ion types and customers may see more consistency across engines. While it's true that the Ion Spec may not specify these conversions, I think centralizing it in the library is a great way to get as close to Ion's true specfiication.

@jobarr-amzn
Copy link
Contributor

This Hive language manual section on types says of timestamps:

Supports traditional UNIX timestamp with optional nanosecond precision.

Supported conversions:

  • Integer numeric types: Interpreted as UNIX timestamp in seconds
  • Floating point numeric types: Interpreted as UNIX timestamp in seconds with decimal precision
  • Strings: JDBC compliant java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision)

Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch. Convenience UDFs for conversion to and from timezones are provided (to_utc_timestamp, from_utc_timestamp).
All existing datetime UDFs (month, day, year, hour, etc.) work with the TIMESTAMP data type.

Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format, declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.

Timestamps in Parquet files may be stored as int64 (as opposed to int96) by setting hive.parquet.write.int64.timestamp=true and hive.parquet.timestamp.time.unit to a default storage time unit. ("nanos", "micros", "millis"; default: "micros"). Note that because only 64 bits are stored, int64 timestamps stored as "nanos" will be stored as NULL if outside the range of 1677-09-21T00:12:43.15 and 2262-04-11T23:47:16.8.

On the table level, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). For example, yyyy-MM-dd'T'HH:mm:ss.SSS,yyyy-MM-dd'T'HH:mm:ss.

Practically it looks like the Hive implementation of JsonSerde uses Hive's TimestampParser [source] to do this.

This seems like something we could incorporate into ion-hive-serde.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants