Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safely Write IntervalMonthDayNanoArray to parquet or Throw error #6299

Conversation

aykut-bozkurt
Copy link
Contributor

@aykut-bozkurt aykut-bozkurt commented Aug 24, 2024

Supports writing IntervalMonthNanoArray to parquet only if its nanoseconds part does not exceed i32::MAX. (since interval logical type, in parquet, is represented as 4 bytes months + 4 bytes days + 4 bytes milliseconds = 12 bytes.)

When the nanoseconds part does not exceed i32::MAX, then it is safe to write it to parquet after truncating it to 4 bytes. Otherwise, we throw error as we lose precision.

Which issue does this PR close?

Closes #6298.

Rationale for this change

This unblocks the ones who needs to write arrow IntervalMonthDayNano with milliseconds precision to parquet. It currently always throws error even if it is safe to write the value to parquet.

What changes are included in this PR?

When the nanoseconds part does not exceed i32::MAX, then it is safe to write it to parquet after truncating it to 4 bytes.
When the nanoseconds part exceeds i32::MAX, then we throw error.

Are there any user-facing changes?

No.

Supports writing IntervalMonthNanoArray to parquet only if its `nanoseconds` part
does not exceed i32::MAX. (since interval logical type, in parquet, is represented as
4 bytes month + 4 bytes days + 4 bytes milliseconds = 12 bytes.)

When the `nanoseconds` part does not exceed i32::MAX, then it is safe to write it to parquet
after truncating it to 4 bytes.

This unblocks the ones who needs to write arrow IntervalMonthDayNano with milliseconds
precision to parquet. It currently always throws error even if it is safe to write
the value to parquet.
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 24, 2024
@tustvold
Copy link
Contributor

This appears to be writing nanoseconds as milliseconds, which is incorrect?

@aykut-bozkurt
Copy link
Contributor Author

aykut-bozkurt commented Aug 24, 2024

This appears to be writing nanoseconds as milliseconds, which is incorrect?

Physically it will not lose precision since we write it only if it <= i32::MAX. But yeah, logically seems incorrect since we write a IntervalNano value to parquet's milliseconds part.

I think it could be an option to check a metadata key like adjusted_to_milliseconds or something like that. In that way, writer would know what they do.

There is currently no way to write the interval with millis to parquet via arrow. And it seems that we can get rid of the current limitation.

@tustvold
Copy link
Contributor

tustvold commented Aug 24, 2024

Right but this is not compatible with the parquet logical type definition, so is simply incorrect...

There is currently no way to write the interval with millis to parquet via arrow

The broader issue here is that parquet doesn't support nanosecond precision intervals, and we're constrained by what the format itself supports - apache/parquet-format#313

@aykut-bozkurt
Copy link
Contributor Author

Right but this is not compatible with the parquet logical type definition, so is simply incorrect...

There is currently no way to write the interval with millis to parquet via arrow

The broader issue here is that parquet doesn't support nanosecond precision intervals, and we're constrained by what the format itself supports - apache/parquet-format#313

Yes, I totally understand the point. But do you think below approach is broken or fragile in the context of arrow to parquet reader/writer?

On writing to parquet:

  • when user passes a metadata key adjusted_as_millisec =>
    • arrow to parquet writer can safely truncate 8 bytes nanoseconds part to parquet's 4 bytes milliseconds part (user is fully aware of the conversion. The user tells the value is already in milliseconds),
  • when metadata key is not passed =>
    • arrow to parquet writer fails as now.

On reading back from parquet:

  • arrow parquet reader reads milliseconds part, then it safely converts it to nanoseconds by multiplying it by 1_000_000 to get a valid IntervalNano array. (reading is not an issue)

@aykut-bozkurt aykut-bozkurt marked this pull request as draft August 25, 2024 10:35
@tustvold
Copy link
Contributor

Perhaps #1938 might work for you?

@aykut-bozkurt
Copy link
Contributor Author

Perhaps #1938 might work for you?

It looks like works, but AFAIU, it is not in wip for interval type yet, right?

@tustvold
Copy link
Contributor

Right I was suggesting it as a potential path forward - I don't think we can merge something that either writes non-spec compliant parquet, or relies on creating non-spec compliant arrow arrays

@tustvold tustvold closed this Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Safely Write IntervalMonthDayNanoArray to parquet or Throw error
2 participants