-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential race condition #33
Comments
This has led me to wonder as to the choice of bloomfilter for this purpose. Usually the overhead is dozens of bytes per item, and when operating on a per language basis it is good enough even for web scale large datasets(~1B items). This is definitely different in a paragraph level basis, but it seems to me that the current code treats individual sentences (text.split('\n')) as a paragraph. But in this case, there are better approaches, such as suffix arrays or text normalization to increase chance of collision and therefore removal |
The race condition is acknowledged and I deemed it a worthwhile trade-off for performance. At typical sizes for the filter (1TB last time I ran it), the chance of two insertions colliding in a way that matters are astronomical. And even if it does collide, the results are not catastrophic. It might make the wrong decision about a particular n-gram. Not a big deal. It would be a big deal if this happened systematically depending on the content, but it does not. The chance of collisions due to a straightforward hash collision, or due to the fuzzy nature of bloom filtering itself, are much higher. It bothers me that this means the bloom filter is not deterministic, but that ship already sailed when we decided to run multiple threads at the same time. With the bloom filter approach, the order of documents always matters, and with threading, the order is not deterministic. This is going to have a much bigger effect than the occasional race condition. |
You're welcome to try other approaches, but consider some numbers before you try: If we're running this on all 2T tokens, we can use a 2TB bloom filter to achieve a false positive rate of about 2%. Close enough. I'd prefer a 4TB filter and get 0.1%, but those machines are hard to get in AWS (and in practice we don't run it over all of Dolma anyways). However, if you do this with an exact membership filter, and you use only 64-bit hashes, you'd need 16TB just for the hashes alone. |
Yes. I think your answer is what I was looking for and I generally agree. In this context that could be addressed by documenting this somewhere in the code or documentation. For exact membership filters the ones i'm familiar with are not ngram or even paragraph level and more associated with document level. In this case what I had in mind was to use exact hash in per document per language context. |
The main reason i opened this issue in the first place was to confirm my suspicions whether there was a race considering how difficult reasoning concurrency was. I wasn't sure if this actually existed. From discord communications with @soldni etc, it seems like it was identified but also considered acceptable. |
I have been looking into https://github.com/allenai/dolma/blob/main/src/bloom_filter.rs
Specifically how it was thread-safe
Here, while the access to individual AtomicU32 values within the array is atomic, the method is checking a condition over multiple AtomicU32 values without a lock on the
Vec
. As such multiple threads could be accessing individual AtomicU32 values independently of each other in both read or write.If one or more of the values change due to another thread's write operation during the execution of contains_hashes, it might lead to an incorrect result.
Consider the scenario where thread A is executing contains_hashes and finds that all the required bits are set. Before thread A completes execution, thread B updates one of the bits that thread A already checked. Since the check is not atomic across the entire vector, thread A's result becomes stale, and the method might return false when it should return true.
This problem is compounded if there are more threads exist changing other items that have already been checked by other threads.
If there are a higher degree of true positives(duplicates) the issue is also exacerbated.
This is particularly concerning since Bloom filters are not supposed to produce false negatives, the problem becomes worse as we try to increase threads or the underlying data has a lot of duplicates.
In summary, The code exhibits a race condition where the result of the contains_hashes method can be affected by concurrent updates to the bits vector. The individual accesses to AtomicU32 are atomic, but the method's logic requires a consistent view of multiple AtomicU32 values, and this consistency is not guaranteed.
To correct this, a higher-level synchronization mechanism (e.g., a read-write lock on the whole of the Vec) might be required to ensure that the entire check operation in contains_hashes is atomic with respect to updates to the bits vector.
This is the approach taken by another rust crate ofilter.
Relevant source
This is an approach taken by google's Guava library for Go
Their solution is not complete nor comprehensive, with a lot of tradeoffs in the datastructures, types and methods operating upon them and tests to validate that their tradeoffs do not cause too much error.
(I do not like this approach but apparently it is good enough for them).
The text was updated successfully, but these errors were encountered: