Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Spatial Join Implementation for Magellan #208

Open
harsha2010 opened this issue Mar 16, 2018 · 0 comments
Open

Better Spatial Join Implementation for Magellan #208

harsha2010 opened this issue Mar 16, 2018 · 0 comments
Milestone

Comments

@harsha2010
Copy link
Owner

The ZOrderCurve Index computed by Magellan today subdivides space into equal sized grids of a given precision.
This can easily be changed to compute curves of a precision between min, max (or other requirements like I want atmost n curves taken together to contain the geometry)
The advantage of doing this is that the index is easy to compute and fits into memory on a single node more easily.
The disadvantage is that we can no longer use the underlying join implementations in Spark at the Physical Plan layer
e.g in point within polygon computation, we'd end up with a single curve of precision say 001100 for the point but maybe a couple of curves of the form 001 and 0100 for the polygon.

A hash join would now not work. Instead we'd need to construct a trie and walk down that data structure to figure out where to send each point.
If the trie is small enough, this is easy.. but if the trie is bigger, we need the trie to be distributed and a partitioning scheme that can be efficient
We also need logic at the physical layer to determine if the trie should be distributed or localized on the driver (i.e. shuffled or streaming join)

All this makes the implementation a lot more complex so it has been lower priority.

The main reasons to implement a better join would be:

  • it is easier to generalize this to all types of Joins apart from point within polygon
  • it might be cheaper to compute the index in some cases
  • today users of Magellan need to know the exact precision to use: the above algorithm would be more forgiving, you can specify a max # of curves for example which is easier to estimate
@harsha2010 harsha2010 added this to the 1.0.6 milestone Mar 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant