Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add join_multiset() #804

Merged
merged 3 commits into from
Jun 30, 2023
Merged

feat: add join_multiset() #804

merged 3 commits into from
Jun 30, 2023

Conversation

zzlk
Copy link
Contributor

@zzlk zzlk commented Jun 30, 2023

also remove documentation about HalfJoinMultiset, the way to access that now is to use join_multiset()

fixes #802

also remove documentation about HalfJoinMultiset, the way to access
that now is to use join_multiset()

/// > 2 input streams of type <(K, V1)> and <(K, V2)>, 1 output stream of type <(K, (V1, V2))>
///
/// This operator is equivalent to `join` except that the LHS and RHS are collected into multisets rather than sets before joining.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm order is preserved I think so multisets might not be the right word

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there are many (multiset) join algorithms, that seems like a side-effect of the implementation that we may not want to provide as a guarantee, so I wouldn't sweat it.

@zzlk zzlk requested a review from MingweiSamuel June 30, 2023 21:06
Copy link
Contributor

@jhellerstein jhellerstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, non-critical


/// > 2 input streams of type <(K, V1)> and <(K, V2)>, 1 output stream of type <(K, (V1, V2))>
///
/// This operator is equivalent to `join` except that the LHS and RHS are collected into multisets rather than sets before joining.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there are many (multiset) join algorithms, that seems like a side-effect of the implementation that we may not want to provide as a guarantee, so I wouldn't sweat it.

/// For example:
/// ```hydroflow
/// lhs = source_iter([("a", 0), ("a", 0)]) -> tee();
/// rhs = source_iter([("a", 0)]) -> tee();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd encourage you to make the RHS distinguishable to avoid confusion in the example.
rhs = source_iter(["a", "hydro")]) -> tee();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zzlk please fix the assert below the code as well

Copy link
Member

@MingweiSamuel MingweiSamuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to yank the extra functionality out of normal join? (Am I remembering correctly thats how it worked before?)

@MingweiSamuel
Copy link
Member

Or in a separate commit

@zzlk
Copy link
Contributor Author

zzlk commented Jun 30, 2023

Do we also want to yank the extra functionality out of normal join? (Am I remembering correctly thats how it worked before?)

I think we can do it later. I removed the documentation for it so people shouldn't use it. The equivalent functionality is now expoed via join_multiset(); And we also need it because the way join_multiset works is just by generating join::();

@zzlk zzlk merged commit 0105246 into hydro-project:main Jun 30, 2023
9 of 10 checks passed
nickjiang2378 pushed a commit to nickjiang2378/hydroflow that referenced this pull request Jan 24, 2024
* feat: add join_multiset()

also remove documentation about HalfJoinMultiset, the way to access
that now is to use join_multiset()

* address comments

* fix assert
nickjiang2378 pushed a commit to nickjiang2378/hydroflow that referenced this pull request Jan 25, 2024
* feat: add join_multiset()

also remove documentation about HalfJoinMultiset, the way to access
that now is to use join_multiset()

* address comments

* fix assert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change join() back to set join and introduce join_multiset()
3 participants