Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XorBinaryFuse8 failed to construct on some data, at around 11511 entries #32

Closed
konghuarukhr opened this issue Dec 19, 2021 · 11 comments
Closed

Comments

@konghuarukhr
Copy link

I find it is not fixed, this is another construction failure:

    public static void main(String[] args) {
        int n = 11511;
        long[] keys = new long[n];
        for (int i = 0; i < n; i++) {
            keys[i] = i;
        }
        XorBinaryFuse8.construct(keys);
    }

will report:

...

11511 keys; arrayLength 14336 reverseOrderPos 6371
Exception in thread "main" java.lang.IllegalStateException
	at org.fastfilter.xor.XorBinaryFuse8.addAll(XorBinaryFuse8.java:219)

Originally posted by @konghuarukhr in #31 (comment)

@thomasmueller
Copy link
Contributor

This very much looks like FastFilter/xorfilter#23

@thomasmueller thomasmueller changed the title XorBinaryFuse8 failed to construct on some data, still happen XorBinaryFuse8 failed to construct on some data, at around 11511 entries Dec 20, 2021
@thomasmueller
Copy link
Contributor

I also found other cases where construction requires more retries than I would expect... One way to resolve the issue is to change the constants in calculateSegmentLength and calculateSizeFactor; I'm trying that now.

@thomasmueller
Copy link
Contributor

I found constants that are more reliable...

--- a/fastfilter/src/main/java/org/fastfilter/xor/XorBinaryFuse8.java
+++ b/fastfilter/src/main/java/org/fastfilter/xor/XorBinaryFuse8.java
@@ -42,7 +42,8 @@ public class XorBinaryFuse8 implements Filter {
     static int calculateSegmentLength(int arity, int size) {
         int segmentLength;
         if (arity == 3) {
-            segmentLength = 1 << (int) Math.floor(Math.log(size) / Math.log(3.33) + 2.25);
+             segmentLength = 1 << (int) Math.floor(Math.log(size) / Math.log(3.33) + 2.11);
@@ -55,7 +56,8 @@ public class XorBinaryFuse8 implements Filter {
     static double calculateSizeFactor(int arity, int size) {
         double sizeFactor;
         if (arity == 3) {
-            sizeFactor = Math.max(1.125, 0.875 + 0.25 * Math.log(1000000) / Math.log(size));
+             sizeFactor = Math.max(1.125, 0.88 + 0.25 * Math.log(1000000) / Math.log(size));

Test case:

    @Test
    public void smallSizes() {
        for (int n = 1; n < 20000; n++) {
            long[] keys = new long[n];
            for (int i = 0; i < n; i++) {
                keys[i] = i;
            }
            XorBinaryFuse8 f = XorBinaryFuse8.construct(keys);
            if (n % 100 == 0) {
                System.out.println("n=" + n + " " + f.toString());
            }            
        }
        for (int n = 20000, x = 0; n < 1000000; n = (int) ((n * 1.0005) + 10), x++) {
            long[] keys = new long[n];
            for (int i = 0; i < n; i++) {
                keys[i] = i;
            }
            XorBinaryFuse8 f = XorBinaryFuse8.construct(keys);
            if (x % 100 == 0) {
                System.out.println("n=" + n + " " + f.toString());
            }
        }
    }

We would then have a maximum of around 30 re-tries at size 2100 .. 2116 (segment length), but no more than that. This is roughly in the middle of this segment length; we have 9 segments.

@lemire I wonder what is your opinion here...

@lemire
Copy link
Member

lemire commented Dec 20, 2021

@thomasmueller I tested both the C version as well as the Go version with n = 11511 and it succeeds but there are many more than 30 re-tries. I am not sure it is a source of concern.

We could update our constants.

@thomasmueller
Copy link
Contributor

Ah, I think the Java version is worse because it uses a different "reduce" function (which uses only 32 bits, instead of 64 bits like in the C / C++ version). I will see if this can be fixed. And then I will to reduce the number of retries.

@thomasmueller
Copy link
Contributor

One of the challenges is that Java still doesn't support an "unsigned 64-bit multiply high" -- see https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8188044 -- so we can't have exactly the same code in Java and C / C++, if we want to keep it at the current (high) performance.

However, I think I found constants that work well. Now I need to write a proper test case and clean this up.

@lemire
Copy link
Member

lemire commented Dec 21, 2021

@thomasmueller The link you offer suggest that this was fixed in Java 18.

@thomasmueller
Copy link
Contributor

@lemire you are right it is fixed in Java 18 (I initially didn't understand this part sorry), but I think we should support older versions of Java as well.

I made a mistake and used a different hash function to search better constants... so I need to test again, and it will take a bit longer.

@thomasmueller
Copy link
Contributor

OK, I believe it is now fixed. The main problem was actually this:

    // use a new random numbers
    seed++;

That was in no way correct... I replaced it with seed = Hash.randomSeed(); now. It was specially a problem if you use incremental keys, like you did in your test!

But I also changed the segment sizes a bit, because I found that construction requires less retries. Now segments tend to be a bit smaller, specially for small sets. And now a IllegalArgumentException is thrown if construction fails.

@thomasmueller
Copy link
Contributor

@konghuarukhr Let me know if this resolves the problem!

@konghuarukhr
Copy link
Author

I don't find any bad case now. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants