Add support for filter selectivity for more expression types #14237

ch-sc · 2025-01-22T15:23:20Z

Is your feature request related to a problem or challenge?

Today, statistics of filter predicates are based on interval arithmetic invoked by PhysicalExec::evaluate_bounds(). This works fine for numerical data. However, many expressions and datatypes are not supported by interval arithmetics and therefore proper selectivity prediction is not supported for such expressions.

I noticed there were lots of discussions regarding statistics in the project lately. Work by folks from Synnada and others is currently in progress. If you feel like this issue is already addressed please let me know. I'd like to offer help with open tasks then.

Describe the solution you'd like

Add support for some missing stuff in interval arithmetics, i.e., temporal data.
Add PhysicalExpr::evaluate_statistics() to calculate expression level statistic. This was already proposed by others.

My suggestion is the following signature:

fn evaluate_statistics(&self, input_statistics: &Statistics) -> Result<ExpressionStatistics>

I think this should return a new statistics struct on expression level which could look like this:

pub struct ExpressionStatistics {
    /// Number of null values
    pub null_count: Precision<usize>,
    /// number of output rows (cardinality)
    pub num_rows: Precision<ScalarValue>,
    /// total number of input rows 
    pub total_rows: Precision<ScalarValue>,
    /// Number of distinct values
    pub distinct_count: Precision<usize>,
}

With evaluate_statistics() we add support for filter expressions such as string comparisons, InList, LikeExpr, or binary operators like IS_DISTINCT_FROM, IS_NOT_DISTINCT_FROM. It may be an iterative approach where we start with a few expression types and take it from there.

Selectivity calculation is trivial: num_rows/total_rows.

We can utilise evaluate_bounds() for supported expressions. For example, from 2*A > B we get its target boundaries and calculate the selectivity as is done in analysis::calculate_selectivity().

fn calculate_selectivity(
    target_boundaries: &[ExprBoundaries],
    initial_boundaries: &[ExprBoundaries],
) -> f64 {
    // Since the intervals are assumed uniform and the values
    // are not correlated, we need to multiply the selectivities
    // of multiple columns to get the overall selectivity.
    initial_boundaries
        .iter()
        .zip(target_boundaries.iter())
        .fold(1.0, |acc, (initial, target)| {
            acc * cardinality_ratio(&initial.interval, &target.interval)
        })
}

This naive approach assumes uni-distributed data. Heuristics, like various distribution types, could be added to ExpressionStatisticsa too. For the sake of simplicity I will not address this here.

Happy to receive some feedback 🙂

Describe alternatives you've considered

No response

Additional context

Short disclaimer: I work for Coralogix like some other datafusion contributors.

cc: @thinkharderdev

The text was updated successfully, but these errors were encountered:

berkaysynnada · 2025-01-23T10:04:06Z

Hi @ch-sc. I try to address your solution suggestions:

Add support for some missing stuff in interval arithmetics, i.e., temporal data.

I highly recommend completing support for all common and applicable data types in interval arithmetic. This would resolve many optimization challenges. You, or anyone interested, can work further on adding this support, and I will certainly assist as much as I can.

Add PhysicalExpr::evaluate_statistics() to calculate expression level statistic. This was already proposed by others.

With our new tools for statistics, implementing such an API will be straightforward since we will handle all types of expression evaluations (that's the hardest part). I believe we can have this ready by next week. Of course, we welcome any improvements and additions to it.
In short, the evaluation mechanism you’re looking for is under way, and we can continue on the discussion after seeing its ready state.

ch-sc added the enhancement New feature or request label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for filter selectivity for more expression types #14237

Add support for filter selectivity for more expression types #14237

ch-sc commented Jan 22, 2025

berkaysynnada commented Jan 23, 2025

Add support for filter selectivity for more expression types #14237

Add support for filter selectivity for more expression types #14237

Comments

ch-sc commented Jan 22, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

berkaysynnada commented Jan 23, 2025