-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the contained method of RowGroupPruningStatistics #8669
Conversation
Hi @alamb, could you help review this PR? |
Thank you @yahoNanJing -- I have this on my review queue. Your use case of large @NGA-TRAN and I have been thinking about how to make PruningPredicate handle cases where not all columns have statistics (e.g. #7869) which may be related |
I believe CI will be fixed by merging up from main -- the clippy issue was fixed in #8662 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @yahoNanJing -- this code is looking quite good to me. I didn't quite make it through the tests today and I ran out of time, but I will finish the review tomorrow
) -> Option<BooleanArray> { | ||
None | ||
let min_values = self.min_values(column)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is very unfortunate that we have to use ScalarValues here when the underlying code uses ArrayRefs (though I realize this PR is just following the same model as the existing code)
let has_null = c.statistics()?.null_count() > 0; | ||
let mut known_not_present = true; | ||
for value in values { | ||
// If it's null, check whether the null exists from the statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If value
is null, I think it means that the statistics value is not known. To the best of my knowledge, NULL
values on the column are never encoded in parquet statistics .
Thus I think this check needs to be something like
if value.is_null() {
known_not_present = false;
break;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alamb, this case is for filters like col is null
. It's not related to the statistics. The values are from the filter literals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification @yahoNanJing. I am still confused -- I don't think col IS NULL
is handled by the LiteralGuarantee
code so I am not sure how it would result in a value
of NULL here.
col IN (NULL)
(as opposed to col IS NULL
) always evaluates to NULL
(can never be true
) which perhaps we should also handle 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think col in (NULL)
will not match any thing, same as col = null
, which means col in (a,b,c)
same as col in (a,b,c, null)
. is there any rule to remove the null
out of in list 🤔 @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
col in (NULL)
will not match any thing, same ascol = null
, which meanscol in (a,b,c)
same ascol in (a,b,c, null)
. is there any rule to remove thenull
out of in list 🤔 @alamb
@Ted-Jiang -- I think we discussed this in #8688
// The filter values should be cast to the boundary's data type | ||
if !can_cast_types(&value.data_type(), &target_data_type) { | ||
return None; | ||
} | ||
let value = | ||
cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS) | ||
.ok()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could combine these checks:
// The filter values should be cast to the boundary's data type | |
if !can_cast_types(&value.data_type(), &target_data_type) { | |
return None; | |
} | |
let value = | |
cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS) | |
.ok()?; | |
// The filter values should be cast to the boundary's data type | |
let Ok(value) = cast_scalar_value(value, &target_data_type, &DEFAULT_CAST_OPTIONS) else { | |
return None; | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. I will refine it.
let schema = Arc::new(Schema::new(vec![ | ||
Field::new("c1", DataType::Int32, false), | ||
Field::new("c2", DataType::Boolean, false), | ||
])); | ||
let schema_descr = arrow_to_parquet_schema(&schema).unwrap(); | ||
|
||
// int > 1 and IsNull(bool) => c1_max > 1 and bool_null_count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// int > 1 and IsNull(bool) => c1_max > 1 and bool_null_count > 0 | |
// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you again @yahoNanJing -- this is really important functionality.
After reviewing the tests carefully, before merging, I think this PR needs
- We need to resolve the
col IS NULL
question (I may just still be confused) - Some additional tests to avoid regressions
In terms of additional tests I think the current tests would not fail if certain behaviors of the code were broken. Thus I suggest we add the following cases that have special handling int he code.
- A test ensuring that multiple guarantees are correctly applied. For example, a predicate like
col1 IN (10) AND col2 IN (20)
and row groups such thatcol1 IN (10)
is can be true butcol2 IN (20)
does not. - A test with a predicate on a column that has no statistics
- A test where the statistics return the incorrect data type (e.g. that the cast has to be present).
let has_null = c.statistics()?.null_count() > 0; | ||
let mut known_not_present = true; | ||
for value in values { | ||
// If it's null, check whether the null exists from the statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the clarification @yahoNanJing. I am still confused -- I don't think col IS NULL
is handled by the LiteralGuarantee
code so I am not sure how it would result in a value
of NULL here.
col IN (NULL)
(as opposed to col IS NULL
) always evaluates to NULL
(can never be true
) which perhaps we should also handle 🤔
@@ -598,19 +662,39 @@ mod tests { | |||
), | |||
vec![1] | |||
); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this new test case covers the new code in this PR (as col IS NULL
doesn't result in a literal guarantee). Is the idea to extend the test coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove the check if has_null && value.is_null() { known_not_present = false; break; }
, the unit test of row_group_pruning_predicate_eq_null_expr
would fail. Then I think the parameter values
can be a set of one element which is a null scalar value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #8688 to track simplifying expressions that have null literals in them (e.g. X IN (NULL)
)
@@ -632,6 +716,29 @@ mod tests { | |||
), | |||
vec![1] | |||
); | |||
|
|||
// c1 < 5 and c2 IS NULL => c1_min < 5 and bool_null_count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// c1 < 5 and c2 IS NULL => c1_min < 5 and bool_null_count > 0 | |
// c1 < 5 and c2 = NULL => c1_min < 5 and bool_null_count > 0 |
let schema = Arc::new(Schema::new(vec![ | ||
Field::new("c1", DataType::Int32, false), | ||
Field::new("c2", DataType::Boolean, false), | ||
])); | ||
let schema_descr = arrow_to_parquet_schema(&schema).unwrap(); | ||
|
||
// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// c1 > 15 and c2 IS NULL => c1_max > 15 and bool_null_count > 0 | |
// c1 > 15 and c2 = NULL => c1_max > 15 and bool_null_count > 0 |
let groups = gen_row_group_meta_data_for_pruning_predicate(); | ||
|
||
let metrics = parquet_file_metrics(); | ||
// bool = NULL always evaluates to NULL (and thus will not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is incorrect of date -- both row groups are actually pruned as the vec is empty
@@ -854,9 +956,9 @@ mod tests { | |||
let rgm2 = get_row_group_meta_data( | |||
&schema_descr, | |||
vec![ParquetStatistics::fixed_len_byte_array( | |||
// 5.00 | |||
// 10.00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why you changed this value in the test?
Would it be possible to change this PR to not change existing tests so it is clear that the code change in this PR doesn't cause a regression in existing behavior? Maybe we can add a new test case with the different values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's main purpose is to prune the rgm2 and keep the rgm1 by the filter c1 in (8, 300, 400)
. If it's a concern, maybe I can introduce another independent new test case for it.
None, | ||
Some(&pruning_predicate), | ||
&metrics | ||
), | ||
vec![1, 2] | ||
); | ||
// c1 in (10, 300, 400) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the first value 0.8
?
// c1 in (10, 300, 400) | |
// c1 in (0.8, 300, 400) |
The same comment applies to several other comments below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May bad. It should be // c1 in (8, 300, 400)
Some(&pruning_predicate), | ||
&metrics | ||
), | ||
vec![0, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vec![0, 2] | |
// rgm2 (index 1) has ranges between 10 and 200. None of the | |
// constants are in that range so expect this is pruned by lliterals | |
vec![0, 2] |
Thanks @alamb for your review and suggestions. It's my bad. Maybe multiple rules are mixed together so that the |
Thanks @yahoNanJing -- I thought about this more. What would you think about adding the code that checks the Perhaps we could add the check after the call to That would have the benefits of:
|
Marking as draft so it is clear this PR is not waiting on feedback |
The reason that the added unit test, https://github.com/apache/arrow-datafusion/blob/b37fc00d1d04114c61c9d2312cbf5044584df3d8/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L996-L1045, does not go through the |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #8668.
Rationale for this change
The basic idea is to check whether all of the values are not within the min-max boundary.
This implementation will be very useful for the case that
high_cardinality_col in (v1, v2, ...)
with bottom parquet files sorted with thehigh_cardinality_col
.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?