Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43594: [C++] Remove std::optional from arrow::ArrayStatistics::is_{min,max}_exact #43595

Merged
merged 1 commit into from
Aug 16, 2024

Conversation

kou
Copy link
Member

@kou kou commented Aug 7, 2024

Rationale for this change

We don't need "unknown" state. If they aren't set, we can process they are not exact.

What changes are included in this PR?

Remove std::optional from arrow::ArrayStatistics::is_{min,max}_exact.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

Copy link

github-actions bot commented Aug 7, 2024

⚠️ GitHub issue #43594 has been automatically assigned in GitHub to PR creator.

@@ -47,14 +47,14 @@ struct ARROW_EXPORT ArrayStatistics {
/// \brief The minimum value, may not be set
std::optional<ValueType> min = std::nullopt;

/// \brief Whether the minimum value is exact or not, may not be set
std::optional<bool> is_min_exact = std::nullopt;
/// \brief Whether the minimum value is exact or not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but IMO is_min_exact = false might still has exact statistics but the reader cannot gurantee that, since apache/parquet-format#216 is new in 2.10 :-(

Copy link
Member Author

@kou kou Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the followings correct?

  1. Parquet 2.9 or earlier data don't have is_min_value_exact/is_max_value_exact
  2. Parquet 2.9 or earlier data use only exact min/max
  3. Parquet 2.10 or later data use exact min/max or non-exact min/max
  4. Parquet 2.10 or later data may use exact min/max without is_min_value_exact/is_max_value_exact

You're focusing on the 2. case, right? Can our Parquet reader detect Parquet version? If so, can we always set true to is_min_exact/is_max_exact for Parquet 2.9 or earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late replying

Parquet 2.9 or earlier data don't have is_min_value_exact/is_max_value_exact

Yes

Parquet 2.9 or earlier data use only exact min/max

I guess no, I'll send a mail to maillist to make it sure

Parquet 2.10 or later data use exact min/max or non-exact min/max

Yes

Parquet 2.10 or later data may use exact min/max without is_min_value_exact/is_max_value_exact

right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, can we always set true to is_min_exact/is_max_exact for Parquet 2.9 or earlier?

Hmmm I'll try to make it clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@mapleFU mapleFU Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway we can mention "exact=false" can also means is exact, lol

Or we can denote that the parquet-c++ output is exact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's use exact=true for Apache Parquet C++ output.

…s::is_{min,max}_exact

We don't need "unknown" state. If they aren't set, we can process they
are not exact.
@kou kou force-pushed the cpp-array-statistics-exact branch from abcfb72 to 8331bf5 Compare August 10, 2024 05:53
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 10, 2024
@kou kou merged commit b80a51a into apache:main Aug 16, 2024
42 of 45 checks passed
@kou kou removed the awaiting changes Awaiting changes label Aug 16, 2024
@kou kou deleted the cpp-array-statistics-exact branch August 16, 2024 05:23
@github-actions github-actions bot added the awaiting changes Awaiting changes label Aug 16, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit b80a51a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants