-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: provide a more precise detection of empty state in MPSCIntrusiveQueue #337
Conversation
f1fea73
to
af2b03b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. You overcomplicated the description a little bit 🙂
From what I understood from diving in myself, if producer A gets suspended while in the cut, then producer B can finish Push()ing and even establish a valid happens-before relationship with consumer C via another variable, but the queue will still report as empty.
base/mpsc_intrusive_queue.h
Outdated
// Until (*) completes, the chain is cut at `prev` and Pop can not reach the item | ||
// and its subsequent items. | ||
MPSC_intrusive_store_next(prev, item); // release (*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change it to release?
reordering is impossible because of the data dependency -> prev
is used as input to store_next
base/mpsc_intrusive_queue.h
Outdated
// Empty state. | ||
// Caveat: if Push() called on an empty queue but has not crossed the blocking point yet, | ||
// we may reach this condition because head_ is a stub and stub.next is nullptr. | ||
// Unfortunately it may lead to a bizarre scenario where the arbitrary number of | ||
// subsequent pushes will fully complete, but the queue will still be observerd | ||
// as empty by the consumer because the chain will be cut by the Push that is stuck updating | ||
// the stub. | ||
// | ||
// More comments: if we had a single Push that is not completed yet, then returning | ||
// an empty state is fine. The problem arises when we have multiple pushes in parallel | ||
// and the first one has not completed yet, but others did. | ||
// Moreover, one of the pushes could even be linearized with this PopWeak call | ||
// using locks/barriers, and still would not be be observed. | ||
// | ||
// To disambiguite, we load the tail_ and check if it is the same as the head. | ||
// MO is not important because for operations that are externally linearized, | ||
// the order is already guaranteed. | ||
// To sum up: | ||
// 1. if tail is not head, it is quaranteed that the queue is not empty. | ||
// 2. If Push(Item) has finished, and was linearized with PopWeak somehow, | ||
// is guaranteed that we will observe the updated tail_. | ||
// 3. Otherwise, it's just an optimization that may or may not return the updated tail. | ||
// Remember that standard guarantees that stores will eventually be visible with any MO, | ||
// therefore we may see here stale (head), or the updated tail even if Push has finished. | ||
T* tail = tail_.load(std::memory_order_relaxed); | ||
return {nullptr, tail == head}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the long comment, but anyone reading this will become totally confused, including me about the memory order. We do the same operation just a few lines below with a one-line comment "middle of a push". I'd emphasize it's important for us to not be eventually consistent, but immediately disambiguate empty/in-progress state (because we can have finished pushes later in the chain) without re-defining all the terms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the memory order is not the issue here. The issue is that we want to avoid these situations for 100%, it's not possible with this API unless we linearize calls using some external conditions.
In other words, even with a single Push that is completed, it's not guaranteed that PopWeak returns non-empty status, because
store/load are not strongly consistent withing themselves. I wanted to empasize this. Please suggest the better comment :)
…Queue The scenario is around remote fiber activation. And it goes like this: 1. A fiber is blocked on EventCount.wait_until (i.e. timed wait) 2. The fiber times outs but in parallel another thread sends it a notification via the remote queue. 3. EventCount::wait_until tries to pull itself from the remote queue, but does not find itself there, even though the notification was pushed STRICTLY BEFORE `active->PullMyselfFromRemoteReadyQueue()` call. We know it because the notify call is under the EventCount spinlock, and PullMyselfFromRemoteReadyQueue is done after the spinlock was grabbed and released. The scenario happens when, the remote queue is empty but then something else tries to notify another, unrelated fiber in the thread, but does not finish crossing the blocking point (*). Then our fiber had notification being pushed into the queue. Then our fiber is waked by the timeout and it tries to pull itself from the queue just to learn the queue is empty, which contradicts the fact we observed BEFORE relationship between notify and PullMyselfFromRemoteReadyQueue call due to the spinlock in EventCount. This fix recognizes the stuck in the middle push by loading the tail and comparing it with head. Signed-off-by: Roman Gershman <[email protected]>
The scenario is around remote fiber activation. And it goes like this:
active->PullMyselfFromRemoteReadyQueue()
call. We know it because the notify call is under the EventCount spinlock, and PullMyselfFromRemoteReadyQueue is done after the spinlock was grabbed and released.The scenario happens when, the remote queue is empty but then something else tries to notify another, unrelated fiber in the thread, but does not finish crossing the blocking point (*). Then our fiber had notification being pushed into the queue. Then our fiber is waked by the timeout and it tries to pull itself from the queue just to learn the queue is empty, which contradicts the fact we observed BEFORE relationship between notify and PullMyselfFromRemoteReadyQueue call due to the spinlock in EventCount.
This fix recognizes the stuck in the middle push by loading the tail and comparing it with head.