Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support expressions in on option for complex Joins #969

Open
lostbean opened this issue Aug 25, 2024 · 3 comments
Open

Support expressions in on option for complex Joins #969

lostbean opened this issue Aug 25, 2024 · 3 comments

Comments

@lostbean
Copy link

Description:

I would like to propose a new feature in Elixir Explorer that allows the on option in the join function to support more complex expressions. Currently, Explorer provides functionality for join operations with a limited ability to specify joins using simple equality
checks (e.g., on: [{"column1", "column2"}]). To enhance the flexibility of joining tables based on complex conditions that cannot be easily expressed with only column names, it would be beneficial to extend the join on clause to accept Explorer expressions.

Example Use Case:

Consider the following SQL query, where a complex condition is used in the ON clause to perform a left join:

SELECT
  AP.table1,
  AP.column1,
  AP.table2,
  AP.column2,
  CASE WHEN AR.has_relationship THEN 1 ELSE 0 END AS 'has_match' 
FROM
  ALL_PAIRS AS AP
LEFT JOIN
  ALL_RELATIONSHIPS AS AR
ON
  ((AR.table_name = AP.table1 AND AR.column_name = AP.column1) AND 
   (AR.referenced_table_name = AP.table2 AND AR.referenced_column_name = AP.column2)) OR
  ((AR.table_name = AP.table2 AND AR.column_name = AP.column2) AND 
   (AR.referenced_table_name = AP.table1 AND AR.referenced_column_name = AP.column1))

Currently, to achieve this behavior in Explorer, one has to perform two joins, as shown below:

result =
  all_pairs
  |> DataFrame.join(all_relationships_1, how: :left,
      on: [
        {"table_name_1", "table_name"}, 
        {"column_name_1", "column_name"}, 
        {"table_name_2", "referenced_table_name"}, 
        {"column_name_2", "referenced_column_name"}
      ])
  |> DataFrame.join(all_relationships_2, how: :left,
      on: [
        {"table_name_2", "table_name"}, 
        {"column_name_2", "column_name"}, 
        {"table_name_1", "referenced_table_name"}, 
        {"column_name_1", "referenced_column_name"}
      ])
  |> DataFrame.mutate(has_relationship: coalesce(has_relationship_1, has_relationship_2))

Proposed Enhancement:

I propose that the on option be enhanced to allow expressions, making it possible to perform complex joins more succinctly. Ideally, the code would look something like this:

result =
  all_pairs
  |> DataFrame.join(all_relationships_1, how: :left,
      on: ((table_name == table_name_1 and column_name = column_name_1) and 
           (referenced_table_name = table_name_2 and referenced_column_name = column_name_2)) or
          ((table_name = table_name_2 and column_name = column_name_2) and 
           (referenced_table_name = table_name_1 and referenced_column_name = column_name_1))
)
@josevalim
Copy link
Member

The best option is most likely to massage the data frame you are joining before hand, by creating the collapsed column, and then joining in that.

@cigrainger is this something you want to support?

@cigrainger
Copy link
Member

In many cases it is best to massage the dataframe, but this is a legit usecase and there's a lot to be said for expressions in the on arg. dplyr offers join_by. Generally, their joining game is much more advanced than ours and it's something we should be considering. I'm not a huge fan of where I ended up originally in the tuple list approach.

@ceyhunkerti
Copy link
Contributor

ceyhunkerti commented Sep 25, 2024

AFAIK, this is also not supported on polars python API and back in days when we needed it we were falling back one of the two options below:

  • Use polars-sql if it's a simple query
  • Polars sql wasn't 100% when we used it so if the sql is complex then we were 👉🏼 ;
    switching to duckdb and back to polars: df() (not a big deal cause there is a zero-copy intgration between the two. )

So in this case I think we can utilize and enhance the sql method we have in Explorer.DataFrame.

  • if I am interpreting it correctly we have the below implementation which only accepts single df and single table_name to register.
  • Given that it can be something like lf_sql (sql: string, registry: [(df, table_name), ...])
    Am I correct?
pub fn lf_sql(
    lf: ExLazyFrame,
    sql_string: &str,
    table_name: &str,
) -> Result<ExLazyFrame, ExplorerError> {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants