Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for filter selectivity for more expression types #14237

Open
ch-sc opened this issue Jan 22, 2025 · 1 comment
Open

Add support for filter selectivity for more expression types #14237

ch-sc opened this issue Jan 22, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@ch-sc
Copy link
Contributor

ch-sc commented Jan 22, 2025

Is your feature request related to a problem or challenge?

Today, statistics of filter predicates are based on interval arithmetic invoked by PhysicalExec::evaluate_bounds(). This works fine for numerical data. However, many expressions and datatypes are not supported by interval arithmetics and therefore proper selectivity prediction is not supported for such expressions.

I noticed there were lots of discussions regarding statistics in the project lately. Work by folks from Synnada and others is currently in progress. If you feel like this issue is already addressed please let me know. I'd like to offer help with open tasks then.

Describe the solution you'd like

  1. Add support for some missing stuff in interval arithmetics, i.e., temporal data.

  2. Add PhysicalExpr::evaluate_statistics() to calculate expression level statistic. This was already proposed by others.

My suggestion is the following signature:

fn evaluate_statistics(&self, input_statistics: &Statistics) -> Result<ExpressionStatistics>

I think this should return a new statistics struct on expression level which could look like this:

pub struct ExpressionStatistics {
    /// Number of null values
    pub null_count: Precision<usize>,
    /// number of output rows (cardinality)
    pub num_rows: Precision<ScalarValue>,
    /// total number of input rows 
    pub total_rows: Precision<ScalarValue>,
    /// Number of distinct values
    pub distinct_count: Precision<usize>,
}

With evaluate_statistics() we add support for filter expressions such as string comparisons, InList, LikeExpr, or binary operators like IS_DISTINCT_FROM, IS_NOT_DISTINCT_FROM. It may be an iterative approach where we start with a few expression types and take it from there.

Selectivity calculation is trivial: num_rows/total_rows.

We can utilise evaluate_bounds() for supported expressions. For example, from 2*A > B we get its target boundaries and calculate the selectivity as is done in analysis::calculate_selectivity().

fn calculate_selectivity(
    target_boundaries: &[ExprBoundaries],
    initial_boundaries: &[ExprBoundaries],
) -> f64 {
    // Since the intervals are assumed uniform and the values
    // are not correlated, we need to multiply the selectivities
    // of multiple columns to get the overall selectivity.
    initial_boundaries
        .iter()
        .zip(target_boundaries.iter())
        .fold(1.0, |acc, (initial, target)| {
            acc * cardinality_ratio(&initial.interval, &target.interval)
        })
}

This naive approach assumes uni-distributed data. Heuristics, like various distribution types, could be added to ExpressionStatisticsa too. For the sake of simplicity I will not address this here.

Happy to receive some feedback 🙂

Describe alternatives you've considered

No response

Additional context

Short disclaimer: I work for Coralogix like some other datafusion contributors.

cc: @thinkharderdev

@ch-sc ch-sc added the enhancement New feature or request label Jan 22, 2025
@berkaysynnada
Copy link
Contributor

Hi @ch-sc. I try to address your solution suggestions:

  1. Add support for some missing stuff in interval arithmetics, i.e., temporal data.

I highly recommend completing support for all common and applicable data types in interval arithmetic. This would resolve many optimization challenges. You, or anyone interested, can work further on adding this support, and I will certainly assist as much as I can.

  1. Add PhysicalExpr::evaluate_statistics() to calculate expression level statistic. This was already proposed by others.

With our new tools for statistics, implementing such an API will be straightforward since we will handle all types of expression evaluations (that's the hardest part). I believe we can have this ready by next week. Of course, we welcome any improvements and additions to it.
In short, the evaluation mechanism you’re looking for is under way, and we can continue on the discussion after seeing its ready state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants