-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue #454: Next in line Rollup #455
Conversation
I like the idea of making the code more efficient, but this line concerns me:
It makes it seem very use-case related, and perhaps we don't know yet the full implications of it. I would suggest adding it as configurable and disabling it by default until we are confident enough. We can do something like:
|
ps: you can set up |
I also read this sentence in the original issue description, which called my attention:
Maybe it is just an incorrect choice of words, but suspect implies that we are not sure. How much confidence do we have that this is actually solving the observed problem? |
@ Jiaweihu08's concerns are valid, but I suspect it is not a real problem. The overhead would be minimal in a low-dimensional space, while in a high-dimensional one, the problem of having huge files counterbalances any additional computation cost. In any case, let's run this compared with the main version and measure it. If it is considerably slower, we should make this faster rather than optional. We should be able to make the new algorithm as computationally intensive as the old one, with a more invasive code change, using a Trie instead of a HashMap. In any case, this should not be a configuration parameter because the indexing results are always better than the old version (in the worst-case scenario, they are equals). @fpj This is the ordered list of files with the new version: The largest file is 15GB, while in the old version, the biggest file was 220G, the second 106G, etc. (in this dataset, some files are bigger, because some rows are larger ~256MB)
When I have time, I'll run a more detailed analysis. You can also use the script posted in #454 to compare the two versions. You should look for the number of cubes in a file and the actual file sizes (which are not reported in the IndexMetrics right now). |
Concern number 1:The priority queue is ordered by depth while For the algorithm to work correctly, we must ensure the cubes are checked in the same order, i.e., that of the z-order. Given a set of cubes, say If the first four Upon closer inspection, we see that a cube's group is not removed from Concern number 2:When indexing a large number of columns, the number of sibling cubes to consider can be high. |
Thanks @Jiaweihu08 ! I don't understand the comment: On concert number 2, yes, on a high dimension that could be an issue, but we should measure it. As I commented, it can be easily fixed by using a Trie Map, instead of a hash Map, or maybe simply iterating on the queue. I'll try to fix this without too many changes. |
As the code is written, |
CubeIds can be sorted; just create the Priority Queue according to CubeIds, in reverse order. |
Replaced by #514 |
Description
As described in issue #454, I'm having some issues indexing high-dimensional spaces as the rollup puts all data in the father, which might result in very large files. This PR solves this problem by rolling up between siblings before rolling up to the father. This results in better file size, but as pointed out by @Jiaweihu08, it could make the algorithm more computation-intensive.
Type of change
This PR changes the Rollup algorithm, trying to pack siblings' cubes first and then packing them with the father.
Checklist: