bloomfilter bulk query optimisation with memory prefetching #325

dcoutts · 2024-08-06T12:41:54Z

For the bloom filter macrobenchmark (on my laptop with 6Mb L3 cache) this is about 2x faster at 25M entries, and about 3x faster at 100M entries.

cabal.project

dcoutts · 2024-08-07T17:08:23Z

Ok, I think this is properly ready now.

Summary results

benchPrepLookups goes from 6.07 to 2.73, so 2.2x better
benchLookupsIO goes from 18.42 to 16.28, so 1.13x better

So while this reduces the CPU cost quite a bit, the I/O of course still dominates (but use of CPU and I/O resources can overlap so it's still useful).

jorisdral · 2024-08-09T09:55:40Z

src/Database/LSMTree/Internal/Lookup.hs

Feel free to disregard, but IMO the changes related to import grouping can be omitted. The alphabetical ordering that stylish-haskell applies already groups the imported modules by the hierarchy 😛

test/Test/Database/LSMTree/Internal/BloomFilter.hs

bench/macro/lsm-tree-bench-bloomfilter.hs

src/Database/LSMTree/Internal/BloomFilterQuery2.hs

test/Test/Database/LSMTree/Internal/Lookup.hs

jorisdral · 2024-08-09T12:30:35Z

test/Test/Database/LSMTree/Internal/Lookup.hs

-      pure $ (kixs, V.map ioopPageSpan ioops)
-    model = bimap VU.fromList V.fromList $
-            prepLookupsModel (fmap (\x -> (snd3 x, thrd3 x)) runs) lookupss
+      pure ( map (\(RunIxKeyIx r k) -> (r,k)) (VP.toList kixs)


Maybe a function toTuple on RunIxKeyIx would be nice

The new optimised bulk query will gets its own module too. This will make it easier to compare. It will also keep the generated code smaller which helps when looking at generated code during optimisation. Also, we will have to keep both versions for better GHC version compatibility.

dcoutts · 2024-10-02T17:39:48Z

@jorisdral I've addressed most of the review comments (thanks again, there was a lot here!).

I've kept most of the review changes as separate commits (mostly marked FIXUP) to make it clearer what changed. I'll fixup/squash them into their preceding commits before merging.

jorisdral

LGTM

jorisdral · 2024-10-03T07:54:24Z

lsm-tree.cabal

+  default:     True
+  manual:      True


So the flag is always turned on? Doesn't seem right

Yes, but the way it's used is that we only use the faster impl for ghc >= 9.4.

We could just make it unconditional (except for ghc version), but then that's harder to benchmark. But perhaps it was only useful during development and we can drop the flag now. Or we can drop it later.

Oh, no sorry, I misunderstood why manual is set to True. I thought one would not be able to set the flag to False because both default and manual are set to True.

This is the semantics, right?

By default, it is set to True

If one passes +bloom-query-fast, it is set to True (no-op)

If one passes -bloom-query-fast, it is set to False

Correct.

So it's on by default (for ghc >= 9.4) but you can turn it off manually.

test/Test/Database/LSMTree/Internal/BloomFilter.hs

And extend it to benchmark the existing bulk query code. So instead of doing lots of single queries, we generate the queries in batches (currently 256) and run individual backends with each batch. This is the appropriate form for comparing the bulk queries against each other or non-bulk versions.

This avoids an unecessary conditional branch (for dividing by zero). Also change BV.unsafeIndex to use Int indexes rather than Word64. It's guaranteed to fit into an Int and this involves conversions in more appropriate places, where we can explain why it must fit.

Using setByteArray at type Word8 allows an optimised impl using memset rather than using a loop at type Word64.

For the bloom filter macrobenchmark (on my laptop with 6Mb L3 cache) this is about 2x faster at 25M entries, and about 3x faster at 100M entries.

And split out a utility for testing class laws in the context of Tasty.

Change the old one to use the same one as the new one Data.Vector.Primitive.Vector RunIxKeyIx rather than Data.Vector.Unboxed.Vector (RunIx, KeyIx) The optimised one uses the primitive vector because it's more compact and only needs one array pointer to access it (rather than a pair of arrays for the unboxed vector of pairs). Also, use the old or new impl in the Lookup implementation, depending on the cabal flag bloom-query-fast (default True).

and eliminate the old bloomQueries that took as an extra argument the initial estimate of the result size. Now both implementations follow the simple strategy of using double the number of keys as the initial estimate, and then doubling the output array size each time it overflows.

And true/false negatives and also FPRs. Example output: bloomQueries (bulk): +++ OK, passed 100 tests. FPR (100 in total): 24% 40% 19% 30% 18% 20% 12% 100% 11% 10% 10% 50% 6% 0% distribution of true/false positives/negatives (73279 in total): 52.962% true negatives 36.723% true positives 10.315% false positives Also make sure we get coverage of at least 5% of high FPR to ensure coverage of array resizing in Bloom{1,2}.bloomQueries

For extra sanity checking. Elsewhere we rely on the invariant holding.

dcoutts · 2024-10-03T09:47:00Z

I think this is all ok now. Fixes squashed.

dcoutts requested review from jorisdral and mheinzel as code owners August 6, 2024 12:41

dcoutts requested a review from phadej August 6, 2024 12:42

dcoutts commented Aug 6, 2024

View reviewed changes

cabal.project Outdated Show resolved Hide resolved

dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch 6 times, most recently from 8fdaaca to af1f146 Compare August 7, 2024 17:05

dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch from 7626b6c to 7e0debd Compare August 7, 2024 21:59

jorisdral reviewed Aug 9, 2024

View reviewed changes

dcoutts added 2 commits October 2, 2024 11:38

Add a property test for the existing bloomQueriesDefault

024c633

dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch from 7e0debd to bf62815 Compare October 2, 2024 17:18

dcoutts requested a review from recursion-ninja as a code owner October 2, 2024 17:18

dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch 2 times, most recently from 9e8b530 to 28537fc Compare October 2, 2024 17:37

dcoutts requested a review from jorisdral October 2, 2024 17:38

jorisdral approved these changes Oct 3, 2024

View reviewed changes

dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch 2 times, most recently from 95f852c to 1c8643f Compare October 3, 2024 09:12

dcoutts added 6 commits October 3, 2024 10:38

Microoptimisation: use memset in bloom filter initialisation

568d52c

Using setByteArray at type Word8 allows an optimised impl using memset rather than using a loop at type Word64.

Add bloom filter bulk lookup optimised using memory prefetching

5824828

For the bloom filter macrobenchmark (on my laptop with 6Mb L3 cache) this is about 2x faster at 25M entries, and about 3x faster at 100M entries.

Test the Prim class laws for Bloom2.CandidateProbe type

51aef7b

And split out a utility for testing class laws in the context of Tasty.

dcoutts added 3 commits October 3, 2024 10:44

Assert bloomInvariant in bloom filter construction

21c3b41

For extra sanity checking. Elsewhere we rely on the invariant holding.

dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch from 1c8643f to 21c3b41 Compare October 3, 2024 09:45

dcoutts enabled auto-merge October 3, 2024 09:47

dcoutts added this pull request to the merge queue Oct 3, 2024

Merged via the queue into main with commit 23fc34f Oct 3, 2024
24 checks passed

dcoutts deleted the dcoutts/bloomfilter-optimisation-2 branch October 3, 2024 10:30

jorisdral mentioned this pull request Oct 25, 2024

Task: optimise Bloom filter lookup using memory prefetching #208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bloomfilter bulk query optimisation with memory prefetching #325

bloomfilter bulk query optimisation with memory prefetching #325

dcoutts commented Aug 6, 2024

dcoutts commented Aug 7, 2024

jorisdral Aug 9, 2024

jorisdral Aug 9, 2024

dcoutts commented Oct 2, 2024

jorisdral left a comment

jorisdral Oct 3, 2024

dcoutts Oct 3, 2024

jorisdral Oct 3, 2024 •

edited

Loading

dcoutts Oct 3, 2024 •

edited

Loading

dcoutts commented Oct 3, 2024

bloomfilter bulk query optimisation with memory prefetching #325

bloomfilter bulk query optimisation with memory prefetching #325

Conversation

dcoutts commented Aug 6, 2024

dcoutts commented Aug 7, 2024

jorisdral Aug 9, 2024

Choose a reason for hiding this comment

jorisdral Aug 9, 2024

Choose a reason for hiding this comment

dcoutts commented Oct 2, 2024

jorisdral left a comment

Choose a reason for hiding this comment

jorisdral Oct 3, 2024

Choose a reason for hiding this comment

dcoutts Oct 3, 2024

Choose a reason for hiding this comment

jorisdral Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

dcoutts Oct 3, 2024 • edited Loading

Choose a reason for hiding this comment

dcoutts commented Oct 3, 2024

jorisdral Oct 3, 2024 •

edited

Loading

dcoutts Oct 3, 2024 •

edited

Loading