Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "A column is known to be entirely NULL" in PruningPredicate #9171

Closed
alamb opened this issue Feb 9, 2024 · 14 comments
Closed

Support "A column is known to be entirely NULL" in PruningPredicate #9171

alamb opened this issue Feb 9, 2024 · 14 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 9, 2024

Is your feature request related to a problem or challenge?

This is broken out from #7869 which is describing a slightly different problem

PruningPredicate can't be told about columns that are known to contain only NULL. It can be told which columns have no nulls (via the PruningStatistics::null_counts()).

Columns that contain only NULL occur in tables that have "schema evolution" -- for example if you have two files such as

File 1: col_a
File 2: col_a, col_b (col_b was added later)

A predicate like col_a != A AND col_b='bananas' can not be true for File 1 (as col_B is logically NULL for all rows)

This is subtly, but importantly different than the case when nothing is known about the column, which confusingly is encoded by returning NULL from PruningStatistics::min_values()

Describe the solution you'd like

  1. Add a new method PruningStatistics::row_counts() to get the total row counts in each container.
  2. Use the information from PruningStatistics::row_counts() and PruningStatistics::null_counts() to determine containers where columns are entirely NULL
  3. Rewrite the predicate, replacing references to columns known to be NULL with a NULL literal and try to simplify the expressions (e.g. a = 5 --> NULL = 5 --> NULL)

For the example in this ticket's description with predicate col_a != A AND col_b='bananas' where col_b is not known and the relevant container had 100 rows,

  1. the relevant PruningStatistics would return col_b: {null_count = 100, row_count = 100}
  2. PruningPredicate::prune would determine col_b was entirely null, and would rewrite the predicate to be col_a != A AND NULL = 'bananas'.
  3. The pruning rewrite would happen again, and this time would not try to fetch min/max statistics for col_b and thus could be proven to be not true.

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Feb 9, 2024
@alamb alamb changed the title Support Support "A column is known to be entirely NULL" in PruningPredicate Feb 9, 2024
@alamb
Copy link
Contributor Author

alamb commented Feb 9, 2024

cc @appletreeisyellow

@appletreeisyellow
Copy link
Contributor

Thank you @alamb!

One question for this line:

File 2: col_b, col_B (col_b was added later)

Do you actually mean:
File 2: col_a, col_b (col_b was added later)

@alamb
Copy link
Contributor Author

alamb commented Feb 9, 2024

Do you actually mean:

Yes, I have update the ticket. Thank you

@appletreeisyellow
Copy link
Contributor

appletreeisyellow commented Feb 9, 2024

Thanks for verifying. I plan to work on this issue

@alamb
Copy link
Contributor Author

alamb commented Feb 11, 2024

One thought I had about this feature was to consider writing predicates like

y = 10

Instead of

y_min <= 10 AND y_max >=10

To something like

CASE 
  WHEN y_null_count == y_row_count THEN false
  ELSE y_min <= 10 AND y_max >=10
END

where y_row_count is a new column based on the value of PruningStatistics::row_counts

I think this would only be valid for top level expressions in the AND clause (not any arbitrary sub expression)

@alamb
Copy link
Contributor Author

alamb commented Feb 11, 2024

It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔

That would be the more elegant (but likely much more substantial) solution in my opinion

@appletreeisyellow
Copy link
Contributor

I think this would only be valid for top level expressions in the AND clause (not any arbitrary sub expression)

Do you mean something like this?

For example, rewriting a complicated predicate like x < 5 AND x > 0

instead of

x_max < 5 AND 0 < x_min

To something like (a)

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 AND 0 < x_min
END

I don't think we want it to be (b)

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5
END
AND
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END

@appletreeisyellow
Copy link
Contributor

appletreeisyellow commented Feb 13, 2024

I think case expression can be wrapped around the OR clause as well. For example, x = 8 OR x < 0 can be rewrite into

CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE (x_min <= 8 AND 8 <= x_max) OR x_max < 0
END

If we know that a container has column x to be entirely NULL, this container can be skipped, regardless the result of OR clause is true or false.

@appletreeisyellow
Copy link
Contributor

For a predicate with more than one columns, we might want to have a WHEN clause for each column.

For example, rewriting a predicate like x < 5 AND x > 0 AND y = 10 into

CASE
  WHEN x_null_count = x_row_count THEN false
  WHEN y_null_count = y_row_count THEN false
  ELSE <rewrite for x < 5 AND x > 0 AND y = 10>
END

@appletreeisyellow
Copy link
Contributor

It might also now be a good time to look into how we could rewrite pruning predicate to use the range analysis in docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html 🤔

That would be the more elegant (but likely much more substantial) solution in my opinion

I'm interested in rewriting pruning predicate to use the range analysis! For now, I'm still writing in PhysicalExpr just to see how the code works. It would be nice to refactor into range analysis after I sort out basic logics

@appletreeisyellow
Copy link
Contributor

appletreeisyellow commented Feb 14, 2024

After discussion with @alamb, we plan to do the implementation in two phases (i.e. two PRs):

  1. Turn each sub expression into a case expression
  2. Simplify the case expression and make it easy to read

1. Turn each sub expression into a case expression

Each sub expression will be rewritten into a case expression instead of wrapping the entire expression into one case expression. Each sub expression has its own case expression will make sure the pruning predict rewrite logic is correct.

For example, x < 5 AND x > 0 OR y = 10

will be rewritten into

# x < 5
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE x_max < 5 
END
AND
#  x > 0
CASE
  WHEN x_null_count = x_row_count THEN false
  ELSE 0 < x_min
END
OR
# y = 10
CASE
  WHEN y_null_count = y_row_count THEN false
  ELSE y_min <= 10 AND 10 <= y_max
END

2. Simplify the case expression and make it easy to read

The above example is formatted in a way that's easy to read. In the actual pruning predicate string, there is no new lines and indentation. The above example looks like this in the query explain:

CASE WHEN x_null_count = x_row_count THEN false ELSE x_max < 5 END AND CASE WHEN x_null_count = x_row_count THEN false ELSE 0 < x_min END OR CASE WHEN y_null_count = y_row_count THEN false ELSE y_min <= 10 AND 10 <= y_max END

As you can see, the final pruning predict rewrite can be long and hard to read. Therefore, we need phase 2 to improve the readability.

Probably add format, like () and new lines to the expression string. I will have a better idea after phase 1 PR is done

@appletreeisyellow
Copy link
Contributor

There is a draft PR: #9223. The major feature is implemented, but there are two things I plan to do before I open it for review (see details).

I plan to continue the work on the week of March 4, 2024.

@appletreeisyellow
Copy link
Contributor

appletreeisyellow commented Mar 14, 2024

Here is an update of the plan

1. Turn each sub expression into a case expression

Implemented in #9223

2. Simplify the case expression and make it easy to read

To be implemented in a follow-up PR:

@alamb
Copy link
Contributor Author

alamb commented Mar 22, 2024

I think this feature is done now that #9223 (comment) is merged so resolving this ticket

@appletreeisyellow if you plan additional work, let's track them in follow on tickets

@alamb alamb closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants