-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bloomfilter bulk query optimisation with memory prefetching #325
Conversation
8fdaaca
to
af1f146
Compare
Ok, I think this is properly ready now. Summary results
So while this reduces the CPU cost quite a bit, the I/O of course still dominates (but use of CPU and I/O resources can overlap so it's still useful). |
7626b6c
to
7e0debd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to disregard, but IMO the changes related to import grouping can be omitted. The alphabetical ordering that stylish-haskell
applies already groups the imported modules by the hierarchy 😛
pure $ (kixs, V.map ioopPageSpan ioops) | ||
model = bimap VU.fromList V.fromList $ | ||
prepLookupsModel (fmap (\x -> (snd3 x, thrd3 x)) runs) lookupss | ||
pure ( map (\(RunIxKeyIx r k) -> (r,k)) (VP.toList kixs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a function toTuple
on RunIxKeyIx
would be nice
The new optimised bulk query will gets its own module too. This will make it easier to compare. It will also keep the generated code smaller which helps when looking at generated code during optimisation. Also, we will have to keep both versions for better GHC version compatibility.
7e0debd
to
bf62815
Compare
9e8b530
to
28537fc
Compare
@jorisdral I've addressed most of the review comments (thanks again, there was a lot here!). I've kept most of the review changes as separate commits (mostly marked FIXUP) to make it clearer what changed. I'll fixup/squash them into their preceding commits before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
default: True | ||
manual: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the flag is always turned on? Doesn't seem right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the way it's used is that we only use the faster impl for ghc >= 9.4.
We could just make it unconditional (except for ghc version), but then that's harder to benchmark. But perhaps it was only useful during development and we can drop the flag now. Or we can drop it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no sorry, I misunderstood why manual
is set to True
. I thought one would not be able to set the flag to False
because both default
and manual
are set to True
.
This is the semantics, right?
- By default, it is set to
True
- If one passes
+bloom-query-fast
, it is set toTrue
(no-op) - If one passes
-bloom-query-fast
, it is set toFalse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
So it's on by default (for ghc >= 9.4) but you can turn it off manually.
95f852c
to
1c8643f
Compare
And extend it to benchmark the existing bulk query code. So instead of doing lots of single queries, we generate the queries in batches (currently 256) and run individual backends with each batch. This is the appropriate form for comparing the bulk queries against each other or non-bulk versions.
This avoids an unecessary conditional branch (for dividing by zero). Also change BV.unsafeIndex to use Int indexes rather than Word64. It's guaranteed to fit into an Int and this involves conversions in more appropriate places, where we can explain why it must fit.
Using setByteArray at type Word8 allows an optimised impl using memset rather than using a loop at type Word64.
For the bloom filter macrobenchmark (on my laptop with 6Mb L3 cache) this is about 2x faster at 25M entries, and about 3x faster at 100M entries.
And split out a utility for testing class laws in the context of Tasty.
Change the old one to use the same one as the new one Data.Vector.Primitive.Vector RunIxKeyIx rather than Data.Vector.Unboxed.Vector (RunIx, KeyIx) The optimised one uses the primitive vector because it's more compact and only needs one array pointer to access it (rather than a pair of arrays for the unboxed vector of pairs). Also, use the old or new impl in the Lookup implementation, depending on the cabal flag bloom-query-fast (default True).
and eliminate the old bloomQueries that took as an extra argument the initial estimate of the result size. Now both implementations follow the simple strategy of using double the number of keys as the initial estimate, and then doubling the output array size each time it overflows.
And true/false negatives and also FPRs. Example output: bloomQueries (bulk): +++ OK, passed 100 tests. FPR (100 in total): 24% 40% 19% 30% 18% 20% 12% 100% 11% 10% 10% 50% 6% 0% distribution of true/false positives/negatives (73279 in total): 52.962% true negatives 36.723% true positives 10.315% false positives Also make sure we get coverage of at least 5% of high FPR to ensure coverage of array resizing in Bloom{1,2}.bloomQueries
For extra sanity checking. Elsewhere we rely on the invariant holding.
1c8643f
to
21c3b41
Compare
I think this is all ok now. Fixes squashed. |
For the bloom filter macrobenchmark (on my laptop with 6Mb L3 cache) this is about 2x faster at 25M entries, and about 3x faster at 100M entries.