Bloom filters untested for large (500M) inputs #22

lemire · 2020-01-18T16:09:38Z

It is likely that the Java Bloom filter implementation suffers from the same issue as the C++ implementation for large inputs FastFilter/fastfilter_cpp#8

thomasmueller · 2020-01-19T20:01:57Z

I found fpp gets bad for 100 million keys and larger.

As far as I see, the solution for the C++ version was to use 128 bit multiplication, which isn't available in Java (at least, not fast). I found the following works for Java:

public void add(long key) {
    long hash = Hash.hash64(key, seed);
    long a = (hash >>> 32) | (hash << 32);
    long b = hash;
    for (int i = 0; i < k; i++) {
        data[Hash.reduce((int) (a >>> 32), arraySize)] |= 1L << a;
        a += b;
    }
}

public boolean mayContain(long key) {
    long hash = Hash.hash64(key, seed);
    long a = (hash >>> 32) | (hash << 32);
    long b = hash;
    for (int i = 0; i < k; i++) {
        if ((data[Hash.reduce((int) (a >>> 32), arraySize)] & 1L << a) == 0) {
            return false;
        }
        a += b;
    }
    return true;
}

Is there anything wrong with the above? Before the change:

size 2000000
BLOOM fpp: 0.008355 size: 2000000 bits/key: 10.0 add ns/key: 60.0 lookup 0% ns/key: 15.0 lookup 100% ns/key: 14.0
size 20000000
BLOOM fpp: 0.00825125 size: 20000000 bits/key: 10.0 add ns/key: 150.0 lookup 0% ns/key: 33.0 lookup 100% ns/key: 58.0
size 200000000
BLOOM fpp: 0.008869465 size: 200000000 bits/key: 10.0 add ns/key: 202.0 lookup 0% ns/key: 45.0 lookup 100% ns/key: 84.0

After the change:
size 2000000
BLOOM fpp: 0.0082 size: 2000000 bits/key: 10.0 add ns/key: 45.0 lookup 0% ns/key: 16.0 lookup 100% ns/key: 14.0
size 20000000
BLOOM fpp: 0.00816025 size: 20000000 bits/key: 10.0 add ns/key: 142.0 lookup 0% ns/key: 33.0 lookup 100% ns/key: 61.0
size 200000000
BLOOM fpp: 0.008209775 size: 200000000 bits/key: 10.0 add ns/key: 177.0 lookup 0% ns/key: 46.0 lookup 100% ns/key: 95.0

Notice fpp for 200'000'000 is much better (0.008209775 instead of 0.008869465).

Speed it not affected it seems.

I can't test with >= 300 million keys.

lemire · 2020-01-19T20:03:37Z

It looks good to me.

thomasmueller · 2020-01-19T20:15:36Z

So the change is committed, but there is no test case yet. It's a bit hard to write one, as running it would require a lot of memory, and it would be slow.

It should be possible to write a test case with much less memory (by only remembering data of a subset of the array; a small window, and forget the rest). The test wouldn't use the real implementation. I'm not sure on how such a test could be made fast.

thomasmueller added a commit that referenced this issue Jan 19, 2020

Bloom filters untested for large (500M) inputs #22

71781a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloom filters untested for large (500M) inputs #22

Bloom filters untested for large (500M) inputs #22

lemire commented Jan 18, 2020 •

edited

Loading

thomasmueller commented Jan 19, 2020

lemire commented Jan 19, 2020

thomasmueller commented Jan 19, 2020

Bloom filters untested for large (500M) inputs #22

Bloom filters untested for large (500M) inputs #22

Comments

lemire commented Jan 18, 2020 • edited Loading

thomasmueller commented Jan 19, 2020

lemire commented Jan 19, 2020

thomasmueller commented Jan 19, 2020

lemire commented Jan 18, 2020 •

edited

Loading