Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

Open
alamb opened this issue Sep 27, 2024 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Sep 27, 2024

Is your feature request related to a problem or challenge?

@ion-elgreco asked in Discord

Does datafusion support a more relaxed Union where the schema can be in a different order? Akin to Polars.concat

The documentation for polars.concat

DataFrame::union requires the inputs to have the same schema

Describe the solution you'd like

No response

Describe alternatives you've considered

Dataframe::concat

Add a Dataframe::concat method that works like this

  let df1 = ... ; // DataFrame with schema {a: int, b: string}
  let df2 = ...; // DataFrame with schema {b: string, a: int}
  let df3 = df1.concat(df2); // Dataframe with schema {a: int, b:string}, all rows from df1 before df2

Implementing this might be somewhat complicated (as there is no existing LogicalPlan that could do this easily)

One way to implement this could be something like add a fake column (__child_number perhaps) to df1 and df2 and have the plan be

let df1 = df1.add_column('__child_number', 1); // add new __child_number column
let df2 = df2.add_column('__child_number', 2); // add new __child_number column
df3 = df1
  .union_with_reorder(df2) // see below for union with reorder
  .order_by('__child_number')
  .project(...) // remove __child_number column

Dataframe::union_with_reorder_schema

  let df1 = ... ; // DataFrame with schema {a: int, b: string}
  let df2 = ...; // DataFrame with schema {b: string, a: int}
  let df3 = df1.union_with_reorder_schema(df2); // Dataframe with schema {a: int, b:string}, rows from df1 and df2 interleaved (like union)

Could implement this with just a Projection that reordered the input schemas and then used existing Union

Change semantics of DataFrame::union to do reordering

Another thing we could do is to change the semantics of Union to do the reordering, but that may have unintended consequences

I don't think there is a dataframe level implementation of that functionality -- though I think it would be straightforward to add (the DataFrame could add a projection to the inputs to rearrange the column order ot match)

Additional context

We should double check with our dataframe exprts like @timsaucer and @Omega359 if this is a reasonable API

@alamb alamb added the enhancement New feature or request label Sep 27, 2024
@alamb alamb changed the title Add some DataFrame::union method for two inputs where the schema can be different Add some DataFrame method(s) to combine two inputs where the schema can be different Sep 27, 2024
@timsaucer
Copy link
Contributor

Are you thinking that this is limited to case of the same schema with different order or should it also include the case of partially overlapping schema? I ask because our internal toolkit does the latter. It would be nice to have, but not necessary.

As far as the API is concerned, keeping it basically as is would be my preferred approach and just updating the internals to reorder if necessary.

@austin362667
Copy link
Contributor

cc @doupache

@ion-elgreco
Copy link

Are you thinking that this is limited to case of the same schema with different order or should it also include the case of partially overlapping schema? I ask because our internal toolkit does the latter. It would be nice to have, but not necessary.

As far as the API is concerned, keeping it basically as is would be my preferred approach and just updating the internals to reorder if necessary.

I would argue that partially overlapping should be possible albeit with a different concat mode. This way we can schema evolve easily

@doupache
Copy link
Contributor

Both Spark and DuckDB push this even further. They have UNION BY NAME, which has a lot of use cases. I think supporting this could be a first step.

@doupache
Copy link
Contributor

take

@Omega359
Copy link
Contributor

I think union_by_name is the core functionality desired here - here is spark's method for it unionByName(other: DataSet[T], allowMissingColumns: Boolean). I suggest this is the functionality that should be worked on for this.

@doupache
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants