Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of open connections resulting in network errors #384

Open
rob-ellison-jet opened this issue Dec 2, 2024 · 10 comments
Open

Number of open connections resulting in network errors #384

rob-ellison-jet opened this issue Dec 2, 2024 · 10 comments

Comments

@rob-ellison-jet
Copy link

We've experienced issues with the snowbridge errors due to connection reset by peer also 404 responses on the http target. The problem is due to the number of open connections on the pods.

Source = kinesis
Target = http

The http target in our case is Google Tag Manager Server-side.

Generate sufficient constant load onto the kinesis stream of approximately 4K records per minute. After a few hours you should start to get either connection reset by peer from aws or 404's from GTM.

Check the number of open connections on the box.

$ netstat -atn | grep TIME_WAIT | wc -l
45269

We're currently using version 2.3.0. The main culprit appears to be following line in the http target.

Applying the following patch drastically reduces the number of open connections:

diff --git a/pkg/target/http.go b/pkg/target/http.go
index 30c293e..6c3cfca 100644
--- a/pkg/target/http.go
+++ b/pkg/target/http.go
@@ -11,6 +11,7 @@ import (
 	"bytes"
 	"encoding/json"
 	"fmt"
+	"io"
 	"net/http"
 	"net/url"
 	"time"
@@ -214,7 +215,8 @@ func (ht *HTTPTarget) Write(messages []*models.Message) (*models.TargetWriteResu
 			failed = append(failed, msg)
 			continue
 		}
-		defer resp.Body.Close()
+		io.Copy(io.Discard, resp.Body)
+		resp.Body.Close()
 		if resp.StatusCode == http.StatusOK {
 			sent = append(sent, msg)
 			if msg.AckFunc != nil { // Ack successful messages
/ $ netstat -ant | grep TIME_WAIT | grep 172.20 | wc -l
3799

In staging we have also tested using the latest version 3.0.0 and have experienced similar issues there also.

@colmsnowplow
Copy link
Collaborator

colmsnowplow commented Dec 2, 2024

Many thanks for investigating and reporting @rob-ellison-jet.

I’m happy to get your fix into next release - if you’d like to be credited, could you make a PR based on develop, and sign the CLA (if you haven’t signed before a bot will comment on your PR)?

Alternatively if you’d prefer we just take it from here, let me know - I can make a quick PR myself.

We have a bit of ongoing testing for the other features in 3.1.0 before we make a full release, but I’ll gladly push an rc image to unblock you in the meantime.

Thanks again for reporting!

@rob-ellison-jet
Copy link
Author

Hi @colmsnowplow,

I've created the PR here:
#385

I had some tests too but tried a few things but couldn't get these working.

@colmsnowplow
Copy link
Collaborator

colmsnowplow commented Dec 3, 2024

Super, thanks so much rob!

No hassle on tests - I'll take a look and will give it at least a manual test before I push a release candidate. Hoping to fit that in today if all goes well, bear with me.

@colmsnowplow
Copy link
Collaborator

colmsnowplow commented Dec 3, 2024

@rob-ellison-jet I have now deployed to the 3.1.0-rc2 image, available on docker.

Note that I have manually tested locally only so far, I won’t get to deploying and testing under load today, but I’ll let you know once I’ve got that done. Shouldn't be more than a couple of days at worst.

@rob-ellison-jet
Copy link
Author

rob-ellison-jet commented Dec 4, 2024

Thanks @colmsnowplow, I haven't tested the change to 3.0.0 under load yet. I only run a load test against the patch to 2.3.0.

Currently we can't use 3.1.0-rc2 in production. I could run a load test on it in staging though. I was planning to deploy the patch 2.3.0 use until we have license sorted. Are there any considerations for the build environment that we need to take into account? Can you think of anything we might miss?

@colmsnowplow
Copy link
Collaborator

colmsnowplow commented Dec 4, 2024

Currently we can't use 3.1.0-rc2 in production.

Ah, sorry I had totally missed the context of the licence! Apologies, I follow you now.

Are there any considerations for the build environment that we need to take into account? Can you think of anything we might miss?

I recently tried to build and deploy a test asset from local, but ran into issues because my machine is an M1 mac, and I struggled to build the linux-compatible image locally. This is not an insurmountable problem, there’s definitely a solve for it, but I don’t know how hard it is to solve, because I changed tack and just deployed from CI instead.

So I’m limited in how helpful I can be in guiding you in taking that approach. But I can suggest two options to make life easier:

  1. Try to run it on GH actions yourself

One approach you could take, is to populate the dockerhub secrets in your forked repo, and try deploying that way.

You’d probably need to comment out the integration test that uses an ngrok token (http target TLS test IIRC) - and probably also the snyk steps that require a token.

IMO if it’s workable, this is the best option from your perspective for now, because it gives you the ability to easily tweak stuff and deploy further patches should you need do so as you iterate/test.

To trigger the deploy with the GH action file in the repo, you’d just need to make sure VERSION and the version variable in cmd/constants.go match, and then tag with that same version.

  1. Failing that, I’ll deploy an asset to unblock you

The above might be painful/difficult to make work - I’m not sure. If you run into issues, make a PR and I’ll be happy to bring it into a non-prod branch and deploy an image to our dockerhub for you.

The idea here would be that it’s just to facilitate unblocking you - because it’s a legacy version and we have our hands fairly full at the moment, I don’t think we’d be able to put time into testing that asset or working on progressing towards a release. (editing to clarify: we'll still progress the v3 changes to release, and test those, regardless. What we'd struggle to fit in is a 2.3.0-compatible legacy release)

If I understand things I think you might be ok with this as a temporary measure since you’ve done a fair bit of testing already yourself.

If neither of these work for you let me know.

I could run a load test on it in staging though.

No need - we need to (and I want to!) regardless - I was just being explicit about what was tested - but I didn’t realise you wouldn’t be able to push v3 to prod.

@rob-ellison-jet
Copy link
Author

rob-ellison-jet commented Dec 4, 2024

If you could go with option 2 then that would be the be the most risk free option. I've had some difficulties in try to get option 1 working fully wil all the e2e and integration tests.

This is the Branch I have with the changes:
https://github.com/rob-ellison-jet/snowbridge/tree/Improvements-to-Connection-Pooling-2.3.0

If you could create a branch I could create a PR.

@rob-ellison-jet
Copy link
Author

rob-ellison-jet commented Dec 5, 2024

Hi @colmsnowplow, just wanted to check you saw this. I spend quite some time already trying the first approach. This is proving quite difficult though. I totally get that creating a new release for an old version involves a lot of work.

Ignore me I just spotted the image on Dockerhub!

@rob-ellison-jet
Copy link
Author

Hi @colmsnowplow, I tested the change related to this PR wrapping in a function and also including defer on the Close body. This didn't work as optimally as the original solution suggested in Issue description.

Really sorry to ask again but would you be able to upload an image from my latest branch change (i will not ask again)?

@colmsnowplow
Copy link
Collaborator

Hey @rob-ellison-jet sorry for the radio silence!

There have been parallel conversations among the business-side people, and they've let me know that there's now a plan to get you on the path to 3.1.x, which hopefully should be the simpler solution. Of course I'm happy to support further where needed - let's co-ordinate through that channel.

Leaving the issue open until we get the fix into full 3.1.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants