Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Delim Join/Get Relations #91

Closed
wants to merge 27 commits into from

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Jul 11, 2024

This PR introduces support for generating and consuming query plans with flattening subquery operators (i.e., delim joins and gets). It also now supports full TPC-H roundtrip through DuckDB.

It's important to note that this is still highly experimental and depends on

Regarding the substrait changes, we need to add a JOIN_TYPE_MARK to the JoinRel and add two new relations.

DelimGetRel and DelimJoinRel.

DelimGetRel mainly has to store the types of delim columns.

e.g.,

message DelimGetRel{
   RelCommon common = 1;
   repeated  Type  delim_types = 2;
}

On the other hand, DelimJoinRel basically has the same attributes as a JoinRel. The difference is that it adds the duplicate_eliminated_columns and an optimization if the delim is flipped or not.

e.g.,

 // The set of columns that will be duplicate eliminated from the LHS and pushed into the RHS
  repeated Expression duplicate_eliminated_columns = 7;

  // If this is a DelimJoin, whether it has been flipped to de-duplicating the RHS instead
  bool delim_flipped = 8;

Besides all that, it was also necessary to two join types. Namely JOIN_TYPE_RIGHT_SEMI and JOIN_TYPE_RIGHT_ANTI;

cc @EpsilonPrime @ianmcook

@pdet
Copy link
Contributor Author

pdet commented Aug 2, 2024

@EpsilonPrime, following up on our conversations, I managed to implement a version of the duplicate eliminated operators that work with the Reference Relations.

Basically, the Duplicate eliminated Get looks like this:

message DelimGetRel{
   RelCommon common = 1;
   ReferenceRel input = 2;
   repeated Expression.FieldReference column_ids = 3;
}

The input is the subtree that is duplicate eliminated from the duplicate eliminated join, and the column IDs of the returned columns that are the duplicate eliminated from the input. A reference relation is also used on the duplicate eliminated join.

The relevant proto changes are in pdet/substrait#1

cc: @ianmcook

@pdet pdet closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants