Replies: 1 comment
-
Closing this out, since there's #7748! Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'd like to point out a subtle yet common error that can result in unexpected Cartesian products in SQL queries when using Ibis.
Currently, there are two Python methods in question.
def get_feat_df(df_clusters: ibis.expr.types.TableExpr, user_clusters: list, id_col: str) -> ibis.expr.types.TableExpr:
def get_feat_df2(df_clusters: ibis.expr.types.TableExpr, user_clusters: list, id_col: str) -> ibis.expr.types.TableExpr:
Their key difference lies in the sequence of operations: one method applies a filter before constructing a list of case statements, while the other constructs both the case statement list and the list needed for filter in parallel. After this parallel construction, it then applies a chained invocation of filter followed by the application of case statements.
Here are the SQL statements generated by these two methods.
first:
second:
Notably, the second method generates SQL that unexpectedly results in a Cartesian product. The reason behind this is that the case statement operates on the original DataFrame, not the filtered DataFrame as would occur in a chained call. This behavior is distinct from what users might be accustomed to in PySpark.
Beta Was this translation helpful? Give feedback.
All reactions