-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794
Conversation
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
5e4d4de
to
877dc31
Compare
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I have a question. I'll approve once it's answered
1e9dcf4
to
33bcac0
Compare
33bcac0
to
f2718c3
Compare
6eaa4cf
to
5e09644
Compare
Do we have proof that this actually solves these two bugs?
I see the original test to prove we can recover from a network error, but not anything proving that the Beat shut down quickly if one of its outgoing network requests is long running. |
I have not had success reproducing those bugs manually, I tried with an older version of Filebeat that should still have the issue, however I didn't go through the whole Windows service manager, I was trying to get the Filebeat to get stuck during its shutdown process. I can try it once more to reproduce and craft a test to ensure the bugs are fixed. |
@marc-gr could you please review this PR? |
I also removed the backport from 8.15 and 8.16. This PR is rather large and the FF has passed, if needed we can backport it to a 8.16 patch release after more extensive testing. |
This does fix #40928 as it was written, but it doesn't fix all of the related issues. The scope of #40928 is only the regression test for the 8.15.1 bug where we would not reconnect after an error. You do have a test for that in this PR, so you can close it. The bugs we can’t close are #40518 and #38666 which have no test proving they were fixed. I think the cause of these is that there was no context propagated from the SIGINT handler to the ES output to allow interrupting a long running connection. I think you wired this up, but we haven’t actually confirmed this fixed the problem so can’t close them. |
…graceful shutdown (#40794) This commit reworks the `eslegclient.Connection` to accept a context in its `Connect` method, this allows the caller to cancel any in flight requests made by the connection by cancelling the context. The libbeat `outputs.Connectable` interface (used by `outputs.NetworkClient`) had to be updated to accept the context, which required refactoring in most of the outputs to also accept a context on connect. The worker from libbeat/publisher/pipeline/client_worker.go now uses a context for it's cancellation instead of a channel, this context is also used when creating a connection to Elasticsearch. An integration test is added to ensure the ES output can always recover from network errors. (cherry picked from commit 4dfef8b)
…graceful shutdown (#40794) (#41454) This commit reworks the `eslegclient.Connection` to accept a context in its `Connect` method, this allows the caller to cancel any in flight requests made by the connection by cancelling the context. The libbeat `outputs.Connectable` interface (used by `outputs.NetworkClient`) had to be updated to accept the context, which required refactoring in most of the outputs to also accept a context on connect. The worker from libbeat/publisher/pipeline/client_worker.go now uses a context for it's cancellation instead of a channel, this context is also used when creating a connection to Elasticsearch. An integration test is added to ensure the ES output can always recover from network errors. (cherry picked from commit 4dfef8b) Co-authored-by: Tiago Queiroz <[email protected]>
Proposed commit message
This commit reworks the
eslegclient.Connection
to accept a context in itsConnect
method, this allows the caller to cancel any in flight requests made by the connection by cancelling the context.The libbeat
outputs.Connectable
interface (used byoutputs.NetworkClient
) had to be updated to accept the context, which required refactoring in most of the outputs to also accept a context on connect.The worker from libbeat/publisher/pipeline/client_worker.go now uses a context for it's cancellation instead of a channel,
this context is also used when creating a connection to Elasticsearch.
An integration test is added to ensure the
ES output can always recover from network errors.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
It's a bug fix, there is no disruptive user impact
## Author's ChecklistHow to test this PR locally
Related issues
## Use cases## Screenshots## Logs