Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bloomfilter bulk query optimisation with memory prefetching #325

Merged
merged 11 commits into from
Oct 3, 2024

Conversation

dcoutts
Copy link
Collaborator

@dcoutts dcoutts commented Aug 6, 2024

For the bloom filter macrobenchmark (on my laptop with 6Mb L3 cache) this is about 2x faster at 25M entries, and about 3x faster at 100M entries.

cabal.project Outdated Show resolved Hide resolved
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch 6 times, most recently from 8fdaaca to af1f146 Compare August 7, 2024 17:05
@dcoutts
Copy link
Collaborator Author

dcoutts commented Aug 7, 2024

Ok, I think this is properly ready now.

Summary results

  • benchPrepLookups goes from 6.07 to 2.73, so 2.2x better
  • benchLookupsIO goes from 18.42 to 16.28, so 1.13x better

So while this reduces the CPU cost quite a bit, the I/O of course still dominates (but use of CPU and I/O resources can overlap so it's still useful).

@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch from 7626b6c to 7e0debd Compare August 7, 2024 21:59
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to disregard, but IMO the changes related to import grouping can be omitted. The alphabetical ordering that stylish-haskell applies already groups the imported modules by the hierarchy 😛

test/Test/Database/LSMTree/Internal/BloomFilter.hs Outdated Show resolved Hide resolved
bench/macro/lsm-tree-bench-bloomfilter.hs Outdated Show resolved Hide resolved
bench/macro/lsm-tree-bench-bloomfilter.hs Outdated Show resolved Hide resolved
bench/macro/lsm-tree-bench-bloomfilter.hs Outdated Show resolved Hide resolved
src/Database/LSMTree/Internal/BloomFilterQuery2.hs Outdated Show resolved Hide resolved
src/Database/LSMTree/Internal/BloomFilterQuery2.hs Outdated Show resolved Hide resolved
test/Test/Database/LSMTree/Internal/Lookup.hs Show resolved Hide resolved
pure $ (kixs, V.map ioopPageSpan ioops)
model = bimap VU.fromList V.fromList $
prepLookupsModel (fmap (\x -> (snd3 x, thrd3 x)) runs) lookupss
pure ( map (\(RunIxKeyIx r k) -> (r,k)) (VP.toList kixs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a function toTuple on RunIxKeyIx would be nice

The new optimised bulk query will gets its own module too. This will
make it easier to compare. It will also keep the generated code smaller
which helps when looking at generated code during optimisation.

Also, we will have to keep both versions for better GHC version
compatibility.
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch from 7e0debd to bf62815 Compare October 2, 2024 17:18
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch 2 times, most recently from 9e8b530 to 28537fc Compare October 2, 2024 17:37
@dcoutts dcoutts requested a review from jorisdral October 2, 2024 17:38
@dcoutts
Copy link
Collaborator Author

dcoutts commented Oct 2, 2024

@jorisdral I've addressed most of the review comments (thanks again, there was a lot here!).

I've kept most of the review changes as separate commits (mostly marked FIXUP) to make it clearer what changed. I'll fixup/squash them into their preceding commits before merging.

Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +105 to +106
default: True
manual: True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the flag is always turned on? Doesn't seem right

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the way it's used is that we only use the faster impl for ghc >= 9.4.

We could just make it unconditional (except for ghc version), but then that's harder to benchmark. But perhaps it was only useful during development and we can drop the flag now. Or we can drop it later.

Copy link
Collaborator

@jorisdral jorisdral Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no sorry, I misunderstood why manual is set to True. I thought one would not be able to set the flag to False because both default and manual are set to True.

This is the semantics, right?

  • By default, it is set to True
  • If one passes +bloom-query-fast, it is set to True (no-op)
  • If one passes -bloom-query-fast, it is set to False

Copy link
Collaborator Author

@dcoutts dcoutts Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

So it's on by default (for ghc >= 9.4) but you can turn it off manually.

@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch 2 times, most recently from 95f852c to 1c8643f Compare October 3, 2024 09:12
And extend it to benchmark the existing bulk query code.

So instead of doing lots of single queries, we generate the queries in
batches (currently 256) and run individual backends with each batch.

This is the appropriate form for comparing the bulk queries against each
other or non-bulk versions.
This avoids an unecessary conditional branch (for dividing by zero).

Also change BV.unsafeIndex to use Int indexes rather than Word64. It's
guaranteed to fit into an Int and this involves conversions in more
appropriate places, where we can explain why it must fit.
Using setByteArray at type Word8 allows an optimised impl using memset
rather than using a loop at type Word64.
For the bloom filter macrobenchmark (on my laptop with 6Mb L3 cache)
this is about 2x faster at 25M entries, and about 3x faster at 100M
entries.
And split out a utility for testing class laws in the context of Tasty.
Change the old one to use the same one as the new one
Data.Vector.Primitive.Vector RunIxKeyIx
rather than
Data.Vector.Unboxed.Vector (RunIx, KeyIx)

The optimised one uses the primitive vector because it's more compact
and only needs one array pointer to access it (rather than a pair of
arrays for the unboxed vector of pairs).

Also, use the old or new impl in the Lookup implementation, depending on
the cabal flag bloom-query-fast (default True).
and eliminate the old bloomQueries that took as an extra argument the
initial estimate of the result size.

Now both implementations follow the simple strategy of using double the
number of keys as the initial estimate, and then doubling the output
array size each time it overflows.
And true/false negatives and also FPRs.

Example output:
    bloomQueries (bulk):
      +++ OK, passed 100 tests.

      FPR (100 in total):
      24% 40%
      19% 30%
      18% 20%
      12% 100%
      11% 10%
      10% 50%
       6% 0%

      distribution of true/false positives/negatives (73279 in total):
      52.962% true negatives
      36.723% true positives
      10.315% false positives

Also make sure we get coverage of at least 5% of high FPR to ensure
coverage of array resizing in Bloom{1,2}.bloomQueries
For extra sanity checking. Elsewhere we rely on the invariant holding.
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-optimisation-2 branch from 1c8643f to 21c3b41 Compare October 3, 2024 09:45
@dcoutts
Copy link
Collaborator Author

dcoutts commented Oct 3, 2024

I think this is all ok now. Fixes squashed.

@dcoutts dcoutts enabled auto-merge October 3, 2024 09:47
@dcoutts dcoutts added this pull request to the merge queue Oct 3, 2024
Merged via the queue into main with commit 23fc34f Oct 3, 2024
24 checks passed
@dcoutts dcoutts deleted the dcoutts/bloomfilter-optimisation-2 branch October 3, 2024 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants