Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement conversion from ColumnStatistics to NullableInterval #10510

Closed

Conversation

demetribu
Copy link
Contributor

Which issue does this PR close?

This change addresses part of #10456.

Rationale for this change

What changes are included in this PR?

  • Added conversion from ColumnStatistics to NullableInterval to handle different nullability scenarios.
  • Defined Interval struct to represent the range of values in a column.
  • Implemented From for NullableInterval, handling Precision cases and determining nullability.
  • Added Unknown variant to NullableInterval to manage cases with insufficient statistics.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the logical-expr Logical plan and expressions label May 14, 2024
@demetribu demetribu force-pushed the convert-stats-to-interval-10456 branch from dd9f55f to 0d75e43 Compare May 14, 2024 21:22
})
match self.data_type() {
Some(datatype) => Ok(Self::Null { datatype }),
None => Err(DataFusionError::NotImplemented(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use not_impl_err! macro

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@demetribu demetribu force-pushed the convert-stats-to-interval-10456 branch from 0d75e43 to 9ff8cb4 Compare May 15, 2024 07:15
@@ -1469,6 +1472,8 @@ pub enum NullableInterval {
MaybeNull { values: Interval },
/// The value is definitely not null, and is within the specified range.
NotNull { values: Interval },
/// Added to handle cases with insufficient statistics
Unknown,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MaybeNull with an unbounded interval corresponds to this Unknown variant, doesn't it ?

Copy link
Contributor Author

@demetribu demetribu May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea. Do you mean we should use ScalarValue::Null in Interval if the value is Absent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, Interval::MaybeNull{Null, Null} looks strange, honestly.
@jayzhan211 wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ScalarValue::Null for representing unbounded intervals is discouraged. Although it's possible to construct these, performing arithmetic on them is not feasible. Could you detect the datatype and instead create ScalarValue::Datatype(None)? This approach is consistent with how we handle unbounded cases throughout the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I saw is that sometimes ColumnStatistics { ..., min, max, ... } both are equal to Precision::Absent. In such cases, we can't determine the type for Interval.

Copy link
Contributor Author

@demetribu demetribu May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried to play with the stats in CLI mode, I attempted to get them from the ExecutionPlan. Here, there was a default absent value:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to get them from the ExecutionPlan. Here, there was a default absent value:

I would expect that calling execution_plan.statistics() would match the schema returned from execution_plan.schema()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will re-check tomorrow then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anywhere we have ColumnStatistics we should also have the schema (and thus the DataType of all columns) -- maybe we just have to pass the schema along with the statistics?

Yes, I think that's the right approach for this issue. Can we open a ticket for this? It should be an easy fix, and ColumnStatistics isn't used in many parts of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
I have conducted some additional tests on various queries and observed the following results:

CREATE TABLE data_table (
    id INT,
    value INT
);
INSERT INTO data_table (id, value) VALUES
(1, 100),
(2, 200),
(3, 300),
(4, 400),
(5, 500),
(6, 600),
(7, 700),
(8, 800),
(9, 900),
(10, 1000);
SELECT id, value FROM data_table WHERE value > 500; 

Log:
Schema: id: Int32, value: Int32
Column 0: ColumnStatistics { null_count: Inexact(0), max_value: Exact(Int32(NULL)), min_value: Exact(Int32(NULL)), distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Inexact(0), max_value: Inexact(Int32(NULL)), min_value: Inexact(Int32(501)), distinct_count: Absent }

SELECT AVG(value) AS average_value, SUM(value) AS total_value FROM data_table; 

Log:
Schema: average_value: Float64, total_value: Int64
Column 0: ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Absent, max_value: Absent, min_value: Absent, distinct_count: Absent }

SELECT a.id AS id_a, a.value AS value_a, b.id AS id_b, b.value AS value_b
FROM data_table a
CROSS JOIN data_table b;

Log:
Schema: id_a: Int32, value_a: Int32, id_b: Int32, value_b: Int32
Column 0: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 1: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 2: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }
Column 3: ColumnStatistics { null_count: Exact(0), max_value: Absent, min_value: Absent, distinct_count: Absent }

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dmitrybugakov @jayzhan211 and @berkaysynnada

@@ -1469,6 +1472,8 @@ pub enum NullableInterval {
MaybeNull { values: Interval },
/// The value is definitely not null, and is within the specified range.
NotNull { values: Interval },
/// Added to handle cases with insufficient statistics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Added to handle cases with insufficient statistics
/// No information is known about this intervals values or nullness

@@ -1469,6 +1472,8 @@ pub enum NullableInterval {
MaybeNull { values: Interval },
/// The value is definitely not null, and is within the specified range.
NotNull { values: Interval },
/// Added to handle cases with insufficient statistics
Unknown,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I saw is that sometimes ColumnStatistics { ..., min, max, ... } both are equal to Precision::Absent. In such cases, we can't determine the type for Interval.

Anywhere we have ColumnStatistics we should also have the schema (and thus the DataType of all columns) -- maybe we just have to pass the schema along with the statistics?

@demetribu demetribu force-pushed the convert-stats-to-interval-10456 branch from 9ff8cb4 to b745673 Compare May 15, 2024 19:10
@demetribu
Copy link
Contributor Author

@alamb
Do we want to introduce a Type in ColumnStats in this PR? I found that the changes will affect Proto and may require other changes that are not clear for me at the moment.

column_statistics: get_col_stats_vec(null_counts, max_values, min_values),

if idx < self.file_schema.fields().len() {

Or should I close the PR and create a new issue instead?

@demetribu demetribu closed this May 17, 2024
@demetribu demetribu deleted the convert-stats-to-interval-10456 branch May 17, 2024 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants