Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented reconnection logic in queue-worker #52

Conversation

bartsmykla
Copy link
Contributor

@bartsmykla bartsmykla commented Jan 15, 2019

It's the second part of: #49 which should resolve openfaas/faas#1031 and #33

Signed-off-by: Bart Smykla [email protected]

Description

Implemented reconnection logic in queue-worker

Motivation and Context

  • I have raised an issue to propose this change (required)

How Has This Been Tested?

I have testd it by deploying it to my local cluster with and without
persistence store in nats (mysql) and I tried to simulate
disconnections from nats server by killing the containers with it,
and trying to async invoke functions.

Logs after I scaled nats server to zero, waited some time (few seconds), and scaled it back again:

func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Disconnected from nats://nats:4222
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Connecting to: nats://nats:4222
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Reconnection (1/5) to nats://nats:4222 failed
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Waiting 2s before next try
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Connecting to: nats://nats:4222
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Reconnection (2/5) to nats://nats:4222 failed
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Waiting 4s before next try
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Connecting to: nats://nats:4222
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Subscribing to: faas-request at nats://nats:4222
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Wait for  5m0s
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Listening on [faas-request], clientID=[faas-worker-453efcc20e04], qgroup=[faas] durable=[]
func_queue-worker.1.ve68xbsffrfq@linuxkit-025000000001    | Reconnection (3/5) to nats://nats:4222 succeeded

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have signed-off my commits with git commit -s
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@bartsmykla
Copy link
Contributor Author

@kozlovic I would be very pleased if you could look at that one too as you did yesterday with #49

Copy link

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time, you don't have concurrent access between connect/reconnect and a publish, however, you do possibly unsubscribe and close connection from the signal handler.
So I would replace the individual connect and subscribe (and connectAndSubscribe) with something like setup() or init() and have that use the lock for the whole time.
The NatsQueue.unsubscribe() and .close() would need to get the lock before attempting to do anything.
Finally, this time I would recommend being able to cancel the sleep in reconnect() as to not block the Ctrl+C.

@bartsmykla
Copy link
Contributor Author

Thanks @kozlovic. I'll work on that tomorrow!

@bartsmykla bartsmykla force-pushed the feature/queue-worker-reconnection branch 2 times, most recently from 6190d33 to 5a1855f Compare January 16, 2019 14:23
@bartsmykla
Copy link
Contributor Author

bartsmykla commented Jan 16, 2019

@kozlovic I think I have addressed all things from your comment. Would you be so kind, and do me PR once again? :-) Thanks for your time!

Copy link

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the for loop in reconnect could be simplified to remove the label and continue. My preference is not to use context, this is just what it is, a preference. I think that the context usage is correct, but again, this is not something I usually use so not too familiar with its usage.

main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
@bartsmykla bartsmykla force-pushed the feature/queue-worker-reconnection branch from 5a1855f to 0c37ef2 Compare January 16, 2019 17:24
Copy link

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alexellis
Copy link
Member

I guess Reconnection should be Reconnecting.. ?

@@ -58,6 +58,26 @@ func (ReadConfig) Read() QueueWorkerConfig {
}
}

if value, exists := os.LookupEnv("faas_max_reconnect"); exists {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should have the faas prefix here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you want me to remove it or not?

main.go Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
}
}
if err := natsQueue.closeConnection(); err != nil {
log.Panicf("Cannot close connection to %s because of an error: %v\n", natsQueue.natsURL, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a panic or just logged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panic should be fine, because it'll still unwind the stack, and it's after we received signal to close everything

val, err := strconv.Atoi(value)

if err != nil {
log.Println("converting faas_max_reconnect to int error:", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the default otherwise? 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

reconnectDelayVal, durationErr := time.ParseDuration(value)

if durationErr != nil {
log.Println("parse env var: faas_reconnect_delay as time.Duration error:", durationErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0

main.go Outdated
@@ -49,17 +50,134 @@ func makeClient() http.Client {
return proxyClient
}

type NatsQueue struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put it in its own file to help with the diff and maintenance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With methods, or just plain struct?

main.go Show resolved Hide resolved
main.go Outdated Show resolved Hide resolved
connMutex: &sync.RWMutex{},
maxReconnect: config.MaxReconnect,
reconnectDelay: config.ReconnectDelay,
quitCh: make(chan struct{}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst making changes for the other comments, can struct{} be exchanged for bool or int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly disagree with that one. This is common pattern to do signalling channels when you are not interested in the value being sent to the channel to use struct{} because it is not allocating any memory for that.

reconnectDelay time.Duration
conn stan.Conn
connMutex *sync.RWMutex
quitCh chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If quitCh is for a graceful shutdown and we're editing the file how about shutdownCh?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need a suffix of ch in that instance?

Copy link
Contributor Author

@bartsmykla bartsmykla Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is ch suffix it's easier to understand looking into to code. At least for me and I saw that pattern a lot in Go code. If you want I can remove it but I will stand behind the idea to leave the suffix.

@bartsmykla bartsmykla force-pushed the feature/queue-worker-reconnection branch from 0c37ef2 to 08424a2 Compare January 29, 2019 16:15
How it was tested:
I have testd it by deploying it to my local cluster with and without
persistence store in nats (mysql) and I tried to simulate
disconnections from nats server by killing the containers with it,
and trying to async invoke functions

Signed-off-by: Bart Smykla <[email protected]>
@bartsmykla bartsmykla force-pushed the feature/queue-worker-reconnection branch from 08424a2 to 1f14fad Compare January 29, 2019 16:19
@alexellis
Copy link
Member

Hi @bartsmykla thanks for editing the commit. Do you know why the build failed?

@alexellis
Copy link
Member

# github.com/openfaas/nats-queue-worker
./main.go:312:5: undefined: "github.com/openfaas/nats-queue-worker/nats".Init
FAIL	github.com/openfaas/nats-queue-worker [build failed]
=== RUN   Test_GetClientID_ContainsHostname
--- PASS: Test_GetClientID_ContainsHostname (0.00s)
=== RUN   TestCreategetClientID
--- PASS: TestCreategetClientID (0.00s)
=== RUN   TestCreategetClientIDWhenHostHasUnsupportedCharacters
--- PASS: TestCreategetClientIDWhenHostHasUnsupportedCharacters (0.00s)
PASS
ok  	github.com/openfaas/nats-queue-worker/handler	0.004s
=== RUN   TestGetClientID
--- PASS: TestGetClientID (0.00s)
=== RUN   TestGetClientIDWhenHostHasUnsupportedCharacters
--- PASS: TestGetClientIDWhenHostHasUnsupportedCharacters (0.00s)
PASS
ok  	github.com/openfaas/nats-queue-worker/nats	0.002s
The command '/bin/sh -c go test -v ./...' returned a non-zero code: 2
The command "./build.sh" exited with 2.

Please see above.

@alexellis
Copy link
Member

I'm not sure why this line is in the PR, but it doesn't appear to work?

go nats.Init("http://" + config.NatsAddress + ":8222")

@bartsmykla
Copy link
Contributor Author

@alexellis It shouldn't be here. When doing rebasing I have commited by mistake a line from nats-streaming prometheus exporter testing. :-/

@alexellis
Copy link
Member

OK. I'll remove it and fix another bug I found in the code via #54

@bartsmykla bartsmykla deleted the feature/queue-worker-reconnection branch February 13, 2019 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate: NATS Streaming crash/lock-up
3 participants