Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794

Merged
merged 26 commits into from
Oct 25, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Sep 12, 2024

Proposed commit message

This commit reworks the eslegclient.Connection to accept a context in its Connect method, this allows the caller to cancel any in flight requests made by the connection by cancelling the context.

The libbeat outputs.Connectable interface (used by outputs.NetworkClient) had to be updated to accept the context, which required refactoring in most of the outputs to also accept a context on connect.

The worker from libbeat/publisher/pipeline/client_worker.go now uses a context for it's cancellation instead of a channel,
this context is also used when creating a connection to Elasticsearch.

An integration test is added to ensure the
ES output can always recover from network errors.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

It's a bug fix, there is no disruptive user impact

## Author's Checklist

How to test this PR locally

  1. Build Filebeat
  2. Get it sending data to ES
  3. Disconnect from the network, stop ES, do anything that will prevent Filebeat from reaching ES
  4. Wait for network error logs
  5. Re-start ES/reconnect to the network
  6. Filebeat should recover and start sending data again.

Related issues

## Use cases
## Screenshots
## Logs

@belimawr belimawr added the skip-ci Skip the build in the CI but linting label Sep 12, 2024
@belimawr belimawr self-assigned this Sep 12, 2024
@belimawr belimawr requested review from a team as code owners September 12, 2024 17:07
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2024
Copy link
Contributor

mergify bot commented Sep 12, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue

Copy link
Contributor

mergify bot commented Sep 12, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Sep 12, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 12, 2024
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Sep 12, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2024
@belimawr belimawr added needs_team Indicates that the issue/PR needs a Team:* label and removed skip-ci Skip the build in the CI but linting labels Sep 12, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2024
@belimawr belimawr added the backport-8.15 Automated backport to the 8.15 branch with mergify label Sep 12, 2024
Copy link
Member

@AndersonQ AndersonQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I have a question. I'll approve once it's answered

libbeat/tests/integration/elasticsearch_test.go Outdated Show resolved Hide resolved
@cmacknz
Copy link
Member

cmacknz commented Oct 23, 2024

Do we have proof that this actually solves these two bugs?

I see the original test to prove we can recover from a network error, but not anything proving that the Beat shut down quickly if one of its outgoing network requests is long running.

@belimawr
Copy link
Contributor Author

belimawr commented Oct 23, 2024

Do we have proof that this actually solves these two bugs?

I see the original test to prove we can recover from a network error, but not anything proving that the Beat shut down quickly if one of its outgoing network requests is long running.

I have not had success reproducing those bugs manually, I tried with an older version of Filebeat that should still have the issue, however I didn't go through the whole Windows service manager, I was trying to get the Filebeat to get stuck during its shutdown process.

I can try it once more to reproduce and craft a test to ensure the bugs are fixed.

@pierrehilbert
Copy link
Collaborator

@marc-gr could you please review this PR?

@belimawr
Copy link
Contributor Author

I've unlinked this PR from #40928. While the refactoring here should fix the issues mentioned on #40928, we still need tests to ensure it. This PR is already rather large, so let's merge it as it is and work on writing the tests for #40928 on a separated PR.

@belimawr belimawr removed backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify labels Oct 24, 2024
@belimawr
Copy link
Contributor Author

I also removed the backport from 8.15 and 8.16. This PR is rather large and the FF has passed, if needed we can backport it to a 8.16 patch release after more extensive testing.

@cmacknz
Copy link
Member

cmacknz commented Oct 24, 2024

This does fix #40928 as it was written, but it doesn't fix all of the related issues.

The scope of #40928 is only the regression test for the 8.15.1 bug where we would not reconnect after an error. You do have a test for that in this PR, so you can close it.

The bugs we can’t close are #40518 and #38666 which have no test proving they were fixed. I think the cause of these is that there was no context propagated from the SIGINT handler to the ES output to allow interrupting a long running connection. I think you wired this up, but we haven’t actually confirmed this fixed the problem so can’t close them.

@belimawr belimawr merged commit 4dfef8b into elastic:main Oct 25, 2024
180 of 183 checks passed
@belimawr belimawr deleted the fix-es-connection-issue branch October 25, 2024 14:50
mergify bot pushed a commit that referenced this pull request Oct 25, 2024
…graceful shutdown (#40794)

This commit reworks the `eslegclient.Connection` to accept a context in its `Connect` method, this allows the caller to cancel any in flight requests made by the connection by cancelling the context.

The libbeat `outputs.Connectable` interface (used by `outputs.NetworkClient`) had to be updated to accept the context, which required refactoring in most of the outputs to also accept a context on connect.

The worker from libbeat/publisher/pipeline/client_worker.go now uses a context for it's cancellation instead of a channel,
this context is also used when creating a connection to Elasticsearch.

An integration test is added to ensure the
ES output can always recover from network errors.

(cherry picked from commit 4dfef8b)
pierrehilbert pushed a commit that referenced this pull request Oct 25, 2024
…graceful shutdown (#40794) (#41454)

This commit reworks the `eslegclient.Connection` to accept a context in its `Connect` method, this allows the caller to cancel any in flight requests made by the connection by cancelling the context.

The libbeat `outputs.Connectable` interface (used by `outputs.NetworkClient`) had to be updated to accept the context, which required refactoring in most of the outputs to also accept a context on connect.

The worker from libbeat/publisher/pipeline/client_worker.go now uses a context for it's cancellation instead of a channel,
this context is also used when creating a connection to Elasticsearch.

An integration test is added to ensure the
ES output can always recover from network errors.

(cherry picked from commit 4dfef8b)

Co-authored-by: Tiago Queiroz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:obs-ds-hosted-services Label for the Observability Hosted Services team Team:Security-Linux Platform Linux Platform Team in Security Solution Team:Security-Windows Platform Windows Platform Team in Security Solution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression test for recovery after Elasticsearch output connection failure
10 participants