You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A filter like WHERE i IN (1,2,3,4,5) can be pushed down to a table from a TableProvider at the TableProvider's discretion. However, an IN subquery like WHERE i IN (SELECT i FROM ....) cannot be pushed down. There are many situtations in which the complexity of doing so could be a sublinear function of the size of the external table, compared to streaming the entire table and performing the filter within DataFusion's physical execution. For example, the table could be a hive partitioned parquet dataset where i is one of the partition columns. So, it could be advantageous to ask the TableProvider whether it would like an InSubquery with a given set of columns to be converted into an InList that could be pushed down.
Of course, this advantage could be drowned out when the number of rows that will be returned by the subquery is sufficiently large. So, the implementation would perhaps need an escape hatch if the subquery ended up returning a number of rows in excess of some threshold. I imagine the costs of both of the feature and escape hatch in DataFusion would be:
limited parallelization as the subquery would need to get run before being pushed down to the external table
opportunity cost when it turns out that the subquery exceeds the threshold and the pushdown attempt is abandoned
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
A filter like
WHERE i IN (1,2,3,4,5)
can be pushed down to a table from a TableProvider at the TableProvider's discretion. However, an IN subquery likeWHERE i IN (SELECT i FROM ....)
cannot be pushed down. There are many situtations in which the complexity of doing so could be a sublinear function of the size of the external table, compared to streaming the entire table and performing the filter within DataFusion's physical execution. For example, the table could be a hive partitioned parquet dataset wherei
is one of the partition columns. So, it could be advantageous to ask the TableProvider whether it would like anInSubquery
with a given set of columns to be converted into anInList
that could be pushed down.Of course, this advantage could be drowned out when the number of rows that will be returned by the subquery is sufficiently large. So, the implementation would perhaps need an escape hatch if the subquery ended up returning a number of rows in excess of some threshold. I imagine the costs of both of the feature and escape hatch in DataFusion would be:
Beta Was this translation helpful? Give feedback.
All reactions