-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable ordered responses for ADS delta watches #752
Enable ordered responses for ADS delta watches #752
Conversation
5c21349
to
ef293f8
Compare
44d23ba
to
487dc31
Compare
487dc31
to
cd45f84
Compare
d66fefa
to
6adf01b
Compare
6adf01b
to
71d67b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only an initial review of the PR. I need to spend more time on the impact of this new shared channel.
My initial analysis is that there is a potential deadlock case as a watch can have queued a response prior to being cancelled. In some edge cases when the client has sent a lot of requests in a short sequence, I believe this channel could end up overflowing the muxed one and then itself. The previous model guarantees that the muxed channel being filled does not block other processes (as they each dequeue within their own goroutine), but here it is no longer the case
f45c59c
to
52998ad
Compare
3895960
to
1fd388b
Compare
pkg/server/delta/v3/server.go
Outdated
watch.responses = make(chan cache.DeltaResponse, 1) | ||
if ordered { | ||
// Use the shared channel to keep the order of responses. | ||
watch.UseSharedResponseChan(watches.deltaMuxedResponses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I globally like the state of this PR, but I think that there is an issue with how the channel is buffered.
This channel will take potentially multiple responses for each type (e.g. if a cache update triggers while a new request is retrieved and tries to enqueue in the same goroutine). As the channel is only buffered for the nb of types this will deadlock
The issue is that theoretically there could be more than 2 responses per type, as the request/response channels are in a select (and therefore are not ordered). In this case even buffering for two times would not guarantee to not deadlock
I think this can be solved if it is guaranteed that enqueued responses would be processed prior to new requests, but I'm not sure that'd be simple
Another possibility is to, like for the sotw, purge the response queue prior to processing a request. In this case I believe buffering with 2x would be enough to guarantee there is no deadlock. I'm honestly not a huge fan as I think it makes the state machine more complex, but it might be okay
@alecholmez do you think this would cover this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valerian-roche If we handle the responses and requests in two different go routines, will that solve the deadlock problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valerian-roche I tried to follow your suggestion to guarantee that the responses ar processed prior to new requests. Could you please take another look?
c45ed63
to
f1a3989
Compare
Some tests failed because of this new change. If this approach is acceptable, I'll go on and fix them. |
a354199
to
5466e1a
Compare
d590c37
to
b0e14ea
Compare
Signed-off-by: huabing zhao <[email protected]>
b0e14ea
to
34cb745
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for your patience on this!
Were you able to test this on a running environments? I might be able to test it on some of our test systems next week if you cannot easily run it
Can you expand in the description describing a bit the implementation and the reasoning behind if people need to take a look later on?
@valerian-roche Thanks. I can test it since I have already encountered this issue in my local development environment. But it may take a few days since I'll be attending KubeCon CN next week and I'm preparing on my talk. Description updated with reasons and hows. |
@valerian-roche Tested in my local development environment. |
thanks @zhaohuabing for fixing this, can we get this merged ? Envoy Gateway (based on delta xDS) needs this |
This import brings in envoyproxy/go-control-plane#752 which should ensure resources pushed via delta ADS follow a specific order to ensure the traffic chain doesnt transiently break Signed-off-by: Arko Dasgupta <[email protected]>
* Bring in go-control-plane fixes This import brings in envoyproxy/go-control-plane#752 which should ensure resources pushed via delta ADS follow a specific order to ensure the traffic chain doesnt transiently break Signed-off-by: Arko Dasgupta <[email protected]> * rerun go generate Signed-off-by: Arko Dasgupta <[email protected]> --------- Signed-off-by: Arko Dasgupta <[email protected]>
Fix #705
Why we need this?
go-control-plane doesn't guarantee the orders of responses for delta ADS. If there are two requests, the response to the second request may be sent before the first request. This causes problems for us since our application generates xDS resources for clusters and listeners in sequence. Sometimes, the the listener resources are sent to Envoy before the cluster resources. As a result, Envoy complains that it can't find referenced clusters in the Listener configuration and fail to apply the configuration.
How this PR fix this issue?
This PR follow an approach similar to #544 to enable ordered responses for ADS delta watches.
deltaMuxedResponses
channel to guarantee that the orders won't be changeddeltaMuxedResponses
channel is created with a buffer size of 2x the number of types.