Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

alamb · 2024-09-27T11:36:53Z

Is your feature request related to a problem or challenge?

@ion-elgreco asked in Discord

Does datafusion support a more relaxed Union where the schema can be in a different order? Akin to Polars.concat

The documentation for polars.concat

DataFrame::union requires the inputs to have the same schema

Describe the solution you'd like

No response

Describe alternatives you've considered

`Dataframe::concat`

Add a Dataframe::concat method that works like this

  let df1 = ... ; // DataFrame with schema {a: int, b: string}
  let df2 = ...; // DataFrame with schema {b: string, a: int}
  let df3 = df1.concat(df2); // Dataframe with schema {a: int, b:string}, all rows from df1 before df2

Implementing this might be somewhat complicated (as there is no existing LogicalPlan that could do this easily)

One way to implement this could be something like add a fake column (__child_number perhaps) to df1 and df2 and have the plan be

let df1 = df1.add_column('__child_number', 1); // add new __child_number column
let df2 = df2.add_column('__child_number', 2); // add new __child_number column
df3 = df1
  .union_with_reorder(df2) // see below for union with reorder
  .order_by('__child_number')
  .project(...) // remove __child_number column

`Dataframe::union_with_reorder_schema`

  let df1 = ... ; // DataFrame with schema {a: int, b: string}
  let df2 = ...; // DataFrame with schema {b: string, a: int}
  let df3 = df1.union_with_reorder_schema(df2); // Dataframe with schema {a: int, b:string}, rows from df1 and df2 interleaved (like union)

Could implement this with just a Projection that reordered the input schemas and then used existing Union

Change semantics of `DataFrame::union` to do reordering

Another thing we could do is to change the semantics of Union to do the reordering, but that may have unintended consequences

I don't think there is a dataframe level implementation of that functionality -- though I think it would be straightforward to add (the DataFrame could add a projection to the inputs to rearrange the column order ot match)

Additional context

We should double check with our dataframe exprts like @timsaucer and @Omega359 if this is a reasonable API

The text was updated successfully, but these errors were encountered:

timsaucer · 2024-09-27T17:15:22Z

Are you thinking that this is limited to case of the same schema with different order or should it also include the case of partially overlapping schema? I ask because our internal toolkit does the latter. It would be nice to have, but not necessary.

As far as the API is concerned, keeping it basically as is would be my preferred approach and just updating the internals to reorder if necessary.

austin362667 · 2024-09-27T17:15:57Z

cc @doupache

ion-elgreco · 2024-09-27T17:17:45Z

Are you thinking that this is limited to case of the same schema with different order or should it also include the case of partially overlapping schema? I ask because our internal toolkit does the latter. It would be nice to have, but not necessary.

As far as the API is concerned, keeping it basically as is would be my preferred approach and just updating the internals to reorder if necessary.

I would argue that partially overlapping should be possible albeit with a different concat mode. This way we can schema evolve easily

doupache · 2024-09-27T17:40:32Z

Both Spark and DuckDB push this even further. They have UNION BY NAME, which has a lot of use cases. I think supporting this could be a first step.

doupache · 2024-09-27T17:40:37Z

take

Omega359 · 2024-09-27T18:05:03Z

I think union_by_name is the core functionality desired here - here is spark's method for it unionByName(other: DataSet[T], allowMissingColumns: Boolean). I suggest this is the functionality that should be worked on for this.

doupache · 2024-09-28T04:26:26Z

take

alamb added the enhancement New feature or request label Sep 27, 2024

alamb changed the title ~~Add some DataFrame::union method for two inputs where the schema can be different~~ Add some DataFrame method(s) to combine two inputs where the schema can be different Sep 27, 2024

github-actions bot assigned doupache Sep 28, 2024

Omega359 mentioned this issue Sep 29, 2024

How to merge multiple data sources and deduplicate based on certain fields? #12532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

alamb commented Sep 27, 2024

timsaucer commented Sep 27, 2024

austin362667 commented Sep 27, 2024

ion-elgreco commented Sep 27, 2024

doupache commented Sep 27, 2024

doupache commented Sep 27, 2024

Omega359 commented Sep 27, 2024

doupache commented Sep 28, 2024

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

Add some DataFrame method(s) to combine two inputs where the schema can be different #12650

Comments

alamb commented Sep 27, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Dataframe::concat

Dataframe::union_with_reorder_schema

Change semantics of DataFrame::union to do reordering

Additional context

timsaucer commented Sep 27, 2024

austin362667 commented Sep 27, 2024

ion-elgreco commented Sep 27, 2024

doupache commented Sep 27, 2024

doupache commented Sep 27, 2024

Omega359 commented Sep 27, 2024

doupache commented Sep 28, 2024

`Dataframe::concat`

`Dataframe::union_with_reorder_schema`

Change semantics of `DataFrame::union` to do reordering