Add `NULL` in list simplifications #8688

alamb · 2023-12-30T14:01:25Z

Is your feature request related to a problem or challenge?

SELECT .. WHERE NULL IN (1,2,3) and SELECT ... WHERE x in (NULL, 2, 3) are always NULL (and thus will filter out all rows). However, DataFusion will still try and evaluate a predicate:

DataFusion CLI v34.0.0
❯ create table t(x int) as values (1), (2), (3);
0 rows in set. Query took 0.003 seconds.

Note there is a FilterExec with a non trivial expression in both of the following queries:

❯ explain select x from t where x IN (null, 2, 3);
+---------------+---------------------------------------------------------------+
| plan_type     | plan                                                          |
+---------------+---------------------------------------------------------------+
| logical_plan  | Filter: t.x = Int32(NULL) OR t.x = Int32(2) OR t.x = Int32(3) |
|               |   TableScan: t projection=[x]                                 |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                   |
|               |   FilterExec: x@0 = NULL OR x@0 = 2 OR x@0 = 3                |
|               |     MemoryExec: partitions=1, partition_sizes=[1]             |
|               |                                                               |
+---------------+---------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

❯ explain select x from t where null IN (x, 2, 3);
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: Int64(NULL) IN ([CAST(t.x AS Int64), Int64(2), Int64(3)])                                                                                                                                                                                                                                                                                                                                   |
|               |   TableScan: t projection=[x]                                                                                                                                                                                                                                                                                                                                                                       |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                         |
|               |   FilterExec: NULL IN ([CastExpr { expr: Column { name: "x", index: 0 }, cast_type: Int64, cast_options: CastOptions { safe: false, format_options: FormatOptions { safe: true, null: "", date_format: None, datetime_format: None, timestamp_format: None, timestamp_tz_format: None, time_format: None, duration_format: Pretty } } }, Literal { value: Int64(2) }, Literal { value: Int64(3) }]) |
|               |     MemoryExec: partitions=1, partition_sizes=[1]                                                                                                                                                                                                                                                                                                                                                   |
|               |                                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.004 seconds.

In both cases the predicate could have been reduced to a single NULL

Describe the solution you'd like

I would like to extend the ExprSimplifier rules to handle the case of NULL IN (...) and when the InList contains NULL

Here are some similar rules
https://github.com/apache/arrow-datafusion/blob/cc3042a6343457036770267f921bb3b6e726956c/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L474-L549

Describe alternatives you've considered

No response

Additional context

These types of expressions are sometimes generated programmatically during rewrites in IOx.

It also came up with discussions with @yahoNanJing on #8669

I think this would be a good first issue as the patterns exist already and the need is well defined.

The text was updated successfully, but these errors were encountered:

asimsedhain · 2023-12-30T21:56:20Z

I would like to take a look at this.

If my understanding is correct, the first case i.e. SELECT .. WHERE NULL IN (…) would simplify to SELECT .. WHERE False.

As for the second case, SELECT .. WHERE x IN (NULL, …) would be simplified to SELECT .. WHERE x IN (…) as it could still return rows for the other exprs in the list.

apacheGH-8688

alamb · 2023-12-31T12:06:29Z

I would like to take a look at this.

Thank you !

If my understanding is correct, the first case i.e. SELECT .. WHERE NULL IN (…) would simplify to SELECT .. WHERE False.

I think it actually would simplify to SELECT .. WHERE null (which is technically different than WHERE false but will have the same effect of returning no rows). This is explained a little more in #8689

Using datafusion-cli, you can check it:

❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.007 seconds.

❯ select NULL IN (1, 2);
+----------------------------------------------------------------------+
| NULL IN (Map { iter: Iter([Literal(Int64(1)), Literal(Int64(2))]) }) |
+----------------------------------------------------------------------+
|                                                                      |
+----------------------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

As for the second case, SELECT .. WHERE x IN (NULL, …) would be simplified to SELECT .. WHERE x IN (…) as it could still return rows for the other exprs in the list.

I actually tried this out and it seems like this is NOT a correct simplification (sorry for my confusion). So maybe we can't simplify such select lists 🤔

DataFusion CLI v34.0.0
❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.006 seconds.

❯ select x IN (NULL, 2) from t;
+-----------------------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(NULL), Literal(Int64(2))]) }) |
+-----------------------------------------------------------------+
|                                                                 |
| true                                                            |
+-----------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

❯ select x IN (2) from t;
+--------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(Int64(2))]) }) |
+--------------------------------------------------+
| false                                            |
| true                                             |
+--------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

❯

Also, the same in postgres:

postgres=# create table t (x int);
CREATE TABLE
postgres=# insert into t values (1), (2);
INSERT 0 2

postgres=# select x in (NULL, 2) from t;
 ?column?
----------

 t
(2 rows)

postgres=# select x in (2) from t;
 ?column?
----------
 f
 t
(2 rows)

apacheGH-8688

asimsedhain · 2023-12-31T21:14:08Z

I see how they are different for the first case. Thank you for explaining.

As for the second case, I agree we might not be able to simplify the list. Any expr in (null, e1, e2...) will be equivalent to expr=null or expr=e1 or .... We cannot really convert the expresion into null without evaluating expr=e1.

Updated the PR to only address the first case.

GH-8688

Ted-Jiang · 2024-01-12T03:26:27Z

I would like to take a look at this.

Thank you !

If my understanding is correct, the first case i.e. SELECT .. WHERE NULL IN (…) would simplify to SELECT .. WHERE False.

I think it actually would simplify to SELECT .. WHERE null (which is technically different than WHERE false but will have the same effect of returning no rows). This is explained a little more in #8689

Using datafusion-cli, you can check it:

❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.007 seconds.

❯ select NULL IN (1, 2);
+----------------------------------------------------------------------+
| NULL IN (Map { iter: Iter([Literal(Int64(1)), Literal(Int64(2))]) }) |
+----------------------------------------------------------------------+
|                                                                      |
+----------------------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

As for the second case, SELECT .. WHERE x IN (NULL, …) would be simplified to SELECT .. WHERE x IN (…) as it could still return rows for the other exprs in the list.

I actually tried this out and it seems like this is NOT a correct simplification (sorry for my confusion). So maybe we can't simplify such select lists 🤔

DataFusion CLI v34.0.0
❯ create table t (x int) as values (1), (2);
0 rows in set. Query took 0.006 seconds.

❯ select x IN (NULL, 2) from t;
+-----------------------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(NULL), Literal(Int64(2))]) }) |
+-----------------------------------------------------------------+
|                                                                 |
| true                                                            |
+-----------------------------------------------------------------+
2 rows in set. Query took 0.002 seconds.

❯ select x IN (2) from t;
+--------------------------------------------------+
| t.x IN (Map { iter: Iter([Literal(Int64(2))]) }) |
+--------------------------------------------------+
| false                                            |
| true                                             |
+--------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

❯

Also, the same in postgres:

postgres=# create table t (x int);
CREATE TABLE
postgres=# insert into t values (1), (2);
INSERT 0 2

postgres=# select x in (NULL, 2) from t;
 ?column?
----------

 t
(2 rows)

postgres=# select x in (2) from t;
 ?column?
----------
 f
 t
(2 rows)

Thanks for explain this! 👍

alamb added enhancement New feature or request good first issue Good for newcomers labels Dec 30, 2023

alamb changed the title ~~Add NULL in list simplifcation~~ Add NULL in list simplifications Dec 30, 2023

alamb mentioned this issue Dec 30, 2023

Implement the contained method of RowGroupPruningStatistics #8669

Closed

asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023

feat: simplify null in list

7fcfb91

apacheGH-8688

asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023

feat: remove nulls from in list

665ca39

apacheGH-8688

asimsedhain mentioned this issue Dec 31, 2023

Simplify NULL [NOT] IN (..) expressions #8691

Merged

asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023

feat: simplify null in list

f36aa5f

apacheGH-8688

asimsedhain added a commit to asimsedhain/arrow-datafusion that referenced this issue Dec 31, 2023

feat: simplify null in list

a894f70

apacheGH-8688

alamb closed this as completed in #8691 Jan 1, 2024

alamb pushed a commit that referenced this issue Jan 1, 2024

feat: simplify null in list (#8691)

e82707e

GH-8688

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `NULL` in list simplifications #8688

Add `NULL` in list simplifications #8688

alamb commented Dec 30, 2023

asimsedhain commented Dec 30, 2023

alamb commented Dec 31, 2023 •

edited

Loading

asimsedhain commented Dec 31, 2023

Ted-Jiang commented Jan 12, 2024

Add NULL in list simplifications #8688

Add NULL in list simplifications #8688

Comments

alamb commented Dec 30, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

asimsedhain commented Dec 30, 2023

alamb commented Dec 31, 2023 • edited Loading

asimsedhain commented Dec 31, 2023

Ted-Jiang commented Jan 12, 2024

Add `NULL` in list simplifications #8688

Add `NULL` in list simplifications #8688

alamb commented Dec 31, 2023 •

edited

Loading