Super Node Series One: Introduce proxy node type #4420
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a first attempt towards solving a famous problem in graph databases world - super node. A brief introduction to super node problem, and the status quo of JanusGraph, is documented at #2717. In a nutshell, JanusGraph has already partially addressed the traversal performance issue, but not the memory/storage issue.
Without native support of super node in JanusGraph (and many other graph databases), a lot of users end up building their own workarounds. Probably the most popular approach is to create "meta" vertices, or "proxy" vertices, and let the application layer redistributes the traversals & data, which is cumbersome and error-prone.
This PR aims to introduce proxy nodes, which work like the partition node before, except that proxy vertices are created only on-demand. There are two basic requirements:
Note that the existing (but discouraged) partition node (a.k.a.
vertex cut
feature), addresses the 1st problem well but performs very bad at the 2nd requirement.It takes a few steps to fully address 1st requirement, and this PR only addresses it partially. This PR requires users to explicitly create proxy nodes, connect them with the canonical node, and EXPLICITLY connect edges to the proxy nodes. This mostly fulfills the 2nd requirement: when there's no need, don't introduce new overhead. The drawback is that the write path is pretty cumbersome, but the good news is that, the read path offers seamless experience. Users could do normal traversal queries as if proxy nodes don't exist.
The brief design is as follows: let's say A is a super node, and A connects to a number of vertices with different labels. Let's say we have a proxy node for A,
Vpa
. We storeproxies
, an array of IDs, as a vertex property in the canonical nodeVa
. Conversely, we storecanonicalId
, the ID ofVa
as a vertex property inVpa
. Every time we need to traverse fromVa
, we always fetchVpa
, and do the traversal from there. Let's sayVa
-.-.-.->Vpa
------->Vb
, then when we traverse fromVb
, we will findVpa
first, and then we retrieveVa
becauseVpa
is just a proxy for Va.TODOs in this PR:
TODOs in subsequent PRs:
C.C. @dxtr-1-0 @rngcntr who expressed interests in this project
Thank you for contributing to JanusGraph!
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
master
)?For code changes:
For documentation related changes: