Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about exchangeKind of ExchangeRel #519

Open
whutjs opened this issue Jul 13, 2023 · 5 comments
Open

Questions about exchangeKind of ExchangeRel #519

whutjs opened this issue Jul 13, 2023 · 5 comments

Comments

@whutjs
Copy link
Contributor

whutjs commented Jul 13, 2023

I am trying to add support in Isthmus for converting Apache Calcite LogicalExchange, as what I described in substrait-io/substrait-java#153. But I run into some problems about how to map the types of exchange from Calcite to Substrait.

In Apache Calcite, there are 6 types of exchange:

  1. BROADCAST_DISTRIBUTED;
  2. HASH_DISTRIBUTED;
  3. RANDOM_DISTRIBUTED;
  4. RANGE_DISTRIBUTED;
  5. ROUND_ROBIN_DISTRIBUTED;
  6. SINGLETON;

In Substrait, there are 5 types of exchange:

  1. Scatter;
  2. Single Bucket;
  3. Multi Bucket;
  4. Broadcast;
  5. Round Robin;

So my questions are:

  • What are single and multi bucket exchanges in Substrait? What are the corresponding exchange types in Calcite?
  • Is hash distributed exchange in Calcite corresponding to the scatter exchange in Substrait?

Thanks in advance!

@jacques-n
Copy link
Contributor

I believe the mappings are (substrait => calcite)

single bucket => singleton
scatter => random distributed
multi-bucket with hash function => hash distributed
multi-bucket with range function => range distributed
round robin => round robin
broadcast => broadcast

@whutjs
Copy link
Contributor Author

whutjs commented Jul 14, 2023

@jacques-n thanks for the reply! But in https://substrait.io/relations/physical_relations/#exchange-operator, it says the scatter means

Distribute data using a system defined hashing function that considers one or more fields. For the same type of fields and same ordering of values, the same partition target should be identified for different ExchangeRels

So is't the scatter corresponding to hash distributed?
As for the multi-bucket in Substrait, it says:

The records should be sent to all bucket numbers provided by the expression.

If I understand it correctly, it seems that one record will be distributed to multi-partitions if we are using muti-bucket type.

@jacques-n
Copy link
Contributor

Yeah, I shouldn't have tried to reply from memory, sorry about that!

Ill write more up tomorrow. I'm not sure there is a multibucket mapping in Calcite. Its for things like skew join where you might broadcast a set of super common high cardinality keys to some nodes and you a different distribution pattern for less common keys.

Will work on writing something more intelligent up tomorrow. Again, sorry about the misdirection!

@whutjs
Copy link
Contributor Author

whutjs commented Jul 20, 2023

@jacques-n hello, May I ask is there any progress on this matter?

@jacques-n
Copy link
Contributor

Hey, sorry about slow response.

The easy ones. Going to use calcite to Substrait since that seems slightly clearer (since things aren't 1:1)
hash => scatter
round robin => round robin
random =>~ round robin
broadcast => broadcast
range => single bucket using range function
singleton =>~ single bucket with a function that always returns one value and declared single partition output
na => multi-bucket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants