A Common Pitfall Leading to Unexpected Cartesian Products #7747

stereoF · 2023-12-14T01:34:55Z

stereoF
Dec 14, 2023

I'd like to point out a subtle yet common error that can result in unexpected Cartesian products in SQL queries when using Ibis.

Currently, there are two Python methods in question.

def get_feat_df(df_clusters: ibis.expr.types.TableExpr, user_clusters: list, id_col: str) -> ibis.expr.types.TableExpr:

value_col_map = {
    "datetime": "tag_value_tm",
    "number": "tag_value_num",
    "string": "tag_value"
}

col_names = [user_cluster['clusterName'] for user_cluster in user_clusters]
df_clusters = df_clusters.filter(
    (df_clusters[id_col].notnull() ) & (df_clusters['cluster_name'].isin(col_names))
)

case_list = []
for user_cluster in user_clusters:
    col_name = user_cluster['clusterName']
    select_type = user_cluster['selectType']
    value_col = value_col_map[select_type]
    
    case_expr = ibis.case()
    case_expr = case_expr.when(df_clusters['cluster_name'] == col_name, df_clusters[value_col])
    case_list.append(case_expr.end().max().name(col_name))

df_feat = df_clusters.group_by([id_col, '$tag_date']).aggregate(case_list)
return df_feat

def get_feat_df2(df_clusters: ibis.expr.types.TableExpr, user_clusters: list, id_col: str) -> ibis.expr.types.TableExpr:

value_col_map = {
    "datetime": "tag_value_tm",
    "number": "tag_value_num",
    "string": "tag_value"
}

case_list = []
col_names = []
for user_cluster in user_clusters:
    col_name = user_cluster['clusterName']
    col_names.append(col_name)
    select_type = user_cluster['selectType']
    value_col = value_col_map[select_type]
    
    case_expr = ibis.case()
    case_expr = case_expr.when(df_clusters['cluster_name'] == col_name, df_clusters[value_col])
    case_list.append(case_expr.end().max().name(col_name))

df_feat = df_clusters.filter(
        (df_clusters[id_col].notnull() ) & (df_clusters['cluster_name'].isin(col_names))
    ).group_by([id_col, '$tag_date']).aggregate(case_list)
return df_feat

Their key difference lies in the sequence of operations: one method applies a filter before constructing a list of case statements, while the other constructs both the case statement list and the list needed for filter in parallel. After this parallel construction, it then applies a chained invocation of filter followed by the application of case statements.

Here are the SQL statements generated by these two methods.

first:

SELECT
  t0."#varchar_id",
  t0."$tag_date",
  MAX(
    CASE
      WHEN (
        t0.cluster_name = 'algo_play_victory'
      )
      THEN t0.tag_value
      ELSE CAST(NULL AS VARCHAR)
    END
  ) AS algo_play_victory,
  MAX(
    CASE
      WHEN (
        t0.cluster_name = 'algo_play_erase_cnt'
      )
      THEN t0.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_erase_cnt,
  MAX(
    CASE
      WHEN (
        t0.cluster_name = 'algo_play_rocket_cnt'
      )
      THEN t0.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_rocket_cnt,
  MAX(
    CASE
      WHEN (
        t0.cluster_name = 'algo_play_bomb_cnt'
      )
      THEN t0.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_bomb_cnt,
  MAX(
    CASE
      WHEN (
        t0.cluster_name = 'algo_play_ifsuperlightball'
      )
      THEN t0.tag_value
      ELSE CAST(NULL AS VARCHAR)
    END
  ) AS algo_play_ifsuperlightball,
  MAX(
    CASE
      WHEN (
        t0.cluster_name = 'algo_play_ifmystery'
      )
      THEN t0.tag_value
      ELSE CAST(NULL AS VARCHAR)
    END
  ) AS algo_play_ifmystery,
  MAX(
    CASE
      WHEN (
        t0.cluster_name = 'algo_play_m'
      )
      THEN t0.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_m
FROM ta.history_tag_4 AS t0
WHERE
  t0."$tag_date" = 20231101
  AND NOT t0."#varchar_id" IS NULL
  AND t0.cluster_name IN ('algo_play_victory', 'algo_play_erase_cnt', 'algo_play_rocket_cnt', 'algo_play_bomb_cnt', 'algo_play_ifsuperlightball', 'algo_play_ifmystery', 'algo_play_m')
GROUP BY
  1,
  2

second:

SELECT
  t0."#varchar_id",
  t0."$tag_date",
  MAX(
    CASE
      WHEN (
        t1.cluster_name = 'algo_play_victory'
      )
      THEN t1.tag_value
      ELSE CAST(NULL AS VARCHAR)
    END
  ) AS algo_play_victory,
  MAX(
    CASE
      WHEN (
        t1.cluster_name = 'algo_play_erase_cnt'
      )
      THEN t1.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_erase_cnt,
  MAX(
    CASE
      WHEN (
        t1.cluster_name = 'algo_play_rocket_cnt'
      )
      THEN t1.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_rocket_cnt,
  MAX(
    CASE
      WHEN (
        t1.cluster_name = 'algo_play_bomb_cnt'
      )
      THEN t1.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_bomb_cnt,
  MAX(
    CASE
      WHEN (
        t1.cluster_name = 'algo_play_ifsuperlightball'
      )
      THEN t1.tag_value
      ELSE CAST(NULL AS VARCHAR)
    END
  ) AS algo_play_ifsuperlightball,
  MAX(
    CASE
      WHEN (
        t1.cluster_name = 'algo_play_ifmystery'
      )
      THEN t1.tag_value
      ELSE CAST(NULL AS VARCHAR)
    END
  ) AS algo_play_ifmystery,
  MAX(
    CASE
      WHEN (
        t1.cluster_name = 'algo_play_m'
      )
      THEN t1.tag_value_num
      ELSE CAST(NULL AS DOUBLE)
    END
  ) AS algo_play_m
FROM (
  SELECT
    t1."#long_id" AS "#long_id",
    t1."#varchar_id" AS "#varchar_id",
    t1."#double_id" AS "#double_id",
    t1.tag_value AS tag_value,
    t1.tag_value_num AS tag_value_num,
    t1.tag_value_tm AS tag_value_tm,
    t1.tag_value_bool AS tag_value_bool,
    t1.tag_value_array_varchar AS tag_value_array_varchar,
    t1.cluster_name AS cluster_name,
    t1."$tag_date" AS "$tag_date"
  FROM ta.history_tag_4 AS t1
  WHERE
    t1."$tag_date" = 20231101
    AND NOT t1."#varchar_id" IS NULL
    AND t1.cluster_name IN ('algo_play_victory', 'algo_play_erase_cnt', 'algo_play_rocket_cnt', 'algo_play_bomb_cnt', 'algo_play_ifsuperlightball', 'algo_play_ifmystery', 'algo_play_m')
) AS t0, ta.history_tag_4 AS t1
GROUP BY
  1,
  2

Notably, the second method generates SQL that unexpectedly results in a Cartesian product. The reason behind this is that the case statement operates on the original DataFrame, not the filtered DataFrame as would occur in a chained call. This behavior is distinct from what users might be accustomed to in PySpark.

cpcloud · 2023-12-18T14:09:34Z

cpcloud
Dec 18, 2023
Maintainer

Closing this out, since there's #7748!

Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Common Pitfall Leading to Unexpected Cartesian Products #7747

{{title}}

Replies: 1 comment

{{title}}

Select a reply

A Common Pitfall Leading to Unexpected Cartesian Products #7747

stereoF Dec 14, 2023

Replies: 1 comment

cpcloud Dec 18, 2023 Maintainer

stereoF
Dec 14, 2023

cpcloud
Dec 18, 2023
Maintainer