Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added reconnection logic when NATS is disconnected #49

Merged
merged 1 commit into from
Jan 29, 2019

Conversation

bartsmykla
Copy link
Contributor

It should resolve the issue: openfaas/faas#1031 and also resolves issues from: #33

Signed-off-by: Bart Smykla [email protected]

Description

Motivation and Context

  • I have raised an issue to propose this change (required)

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I've read the CONTRIBUTION guide
  • I have signed-off my commits with git commit -s
  • I have added tests to cover my changes.
  • All new and existing tests passed.

handler/handler.go Show resolved Hide resolved
handler/handler.go Outdated Show resolved Hide resolved
handler/handler.go Show resolved Hide resolved
Copy link

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I had in mind for the mutex. Would still return result from nc.Publish() directly. Also, you did not address the logging when exhausting the number of reconnect. Granted that if the attempts are statically defined and you do log reconnect attempts, the absence of success logging implies the failure. But if you end-up making the attempts count user settable, then I would recommend logging the total failure.

handler/handler.go Outdated Show resolved Hide resolved
@bartsmykla bartsmykla force-pushed the bugfix/nats-reconnection branch 2 times, most recently from 8ce810d to 8ea739e Compare January 14, 2019 16:45
@bartsmykla
Copy link
Contributor Author

@kozlovic I think that all of your comments are done now. Can you check again, please?

@bartsmykla
Copy link
Contributor Author

@alexellis if we want to have two of the values (maxReconnect and delayBetweenReconnect) as a part of NatsConfig, we'll need to change it in gateway. I have created new function with defaults NewDefaultNatsConfig

@kozlovic
Copy link

@bartsmykla I can comment only on the usage of the NATS Streaming API and that looks good to me!

handler/handler.go Outdated Show resolved Hide resolved
handler/handler.go Outdated Show resolved Hide resolved
handler/nats_config.go Outdated Show resolved Hide resolved
handler/handler.go Outdated Show resolved Hide resolved
handler/handler.go Outdated Show resolved Hide resolved
Copy link
Member

@alexellis alexellis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very positive so far. This work would need to be done in main.go too for the subscriber (nats-queue-worker itself)?

handler/handler.go Outdated Show resolved Hide resolved
@alexellis
Copy link
Member

Hi Bart, good work on this 👍

Ivan thanks so much for your help here.

Do you have a Docker image pushed that I can use to test the change locally?

Alex

@alexellis
Copy link
Member

Quick question - (What output do you get when you scale the gateway from 1 to 0 replicas in the logs? Do you see it try to reconnect?)

Alex

@bartsmykla
Copy link
Contributor Author

@alexellis image is: bartsmykla/gateway:0.0.15

@bartsmykla
Copy link
Contributor Author

@alexellis the change is for gateway, so when it's scaled down to 0, there is no logs related to reconnection

@alexellis
Copy link
Member

the change is for gateway, so when it's scaled down to 0, there is no logs related to reconnection

Yes that's why I asked to scale the gateway down and watch the logs. Please could you try that and let us know if any logs about reconnecting are generated? This question is relating to Ivan's comment on shutdown.

@bartsmykla
Copy link
Contributor Author

bartsmykla commented Jan 15, 2019

They are not (I checked that before, so my comment was written after me checking). There is no logs after. Container is just killed and that's it.

How was it tested:
I deployed this version of nats-queue-worker and then simulated
NATS disconnection

Signed-off-by: Bart Smykla <[email protected]>
@alexellis alexellis merged commit c41f9b1 into openfaas:master Jan 29, 2019
@bartsmykla bartsmykla deleted the bugfix/nats-reconnection branch January 29, 2019 14:44
@alexellis
Copy link
Member

Thanks for your work on this Bart and @kozlovic for lending your expertise. This is now merged for the producer (gateway) and will be released.

@alexellis
Copy link
Member

alexellis commented Jan 29, 2019

I've tested out the changes with a new gateway and it seems like the reconnect doesn't work with the in-memory store at all. Is this to be expected?

Also the buffer seemed to fill up with a tiny request while the NATS connection was severed?

func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:22:17 Forwarded [POST] to /async-function/figlet - [202] - 0.004220 seconds
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:22:17 Forwarded [POST] to /system/async-report - [202] - 0.000080 seconds
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:22:17 Forwarded [POST] to /system/async-report - [202] - 0.000178 seconds
^Z
[2]+  Stopped                 docker service logs func_gateway -f
alexellis-a02:faas alexellis$ bg
[2]+ docker service logs func_gateway -f &
alexellis-a02:faas alexellis$ 
alexellis-a02:faas alexellis$ 
alexellis-a02:faas alexellis$ 
alexellis-a02:faas alexellis$ 
alexellis-a02:faas alexellis$ docker service scale func_nats=0
func_nats scaled to 0
overall progress: 0 out of 0 tasks 
verify: Service converged 
alexellis-a02:faas alexellis$ curl http://127.0.0.1:8080/async-function/figlet -d alex
nats: outbound buffer limit exceededfunc_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:22:54 Forwarded [POST] to /async-function/figlet - [500] - 0.000362 seconds
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | nats: outbound buffer limit exceeded
alexellis-a02:faas alexellis$ curl http://127.0.0.1:8080/async-function/figlet -d alex
nats: outbound buffer limit exceededfunc_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | nats: outbound buffer limit exceeded
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:22:55 Forwarded [POST] to /async-function/figlet - [500] - 0.000229 seconds
alexellis-a02:faas alexellis$ curl http://127.0.0.1:8080/async-function/figlet -d alex
nats: outbound buffer limit exceededfunc_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:22:55 Forwarded [POST] to /async-function/figlet - [500] - 0.000147 seconds
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | nats: outbound buffer limit exceeded
alexellis-a02:faas alexellis$ curl http://127.0.0.1:8080/async-function/figlet -d alex
nats: outbound buffer limit exceededfunc_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | nats: outbound buffer limit exceeded
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:22:56 Forwarded [POST] to /async-function/figlet - [500] - 0.000340 seconds
alexellis-a02:faas alexellis$ 
alexellis-a02:faas alexellis$ 
alexellis-a02:faas alexellis$ 
alexellis-a02:faas alexellis$ docker service logs func_gateway -f
alexellis-a02:faas alexellis$ docker service scale func_nats=1
func_nats scaled to 1
overall progress: 1 out of 1 tasks 
1/1: running   [==================================================>] 
verify: Service converged 
alexellis-a02:faas alexellis$ curl http://127.0.0.1:8080/async-function/figlet -d alex
stan: connection closedfunc_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:23:15 Forwarded [POST] to /async-function/figlet - [500] - 0.000109 seconds
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | stan: connection closed
alexellis-a02:faas alexellis$ curl http://127.0.0.1:8080/async-function/figlet -d alex
stan: connection closedfunc_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | stan: connection closed
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:23:16 Forwarded [POST] to /async-function/figlet - [500] - 0.000456 seconds
alexellis-a02:faas alexellis$ curl http://127.0.0.1:8080/async-function/figlet -d alex
stan: connection closedfunc_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request: figlet.
alexellis-a02:faas alexellis$ func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | stan: connection closed
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | 2019/01/29 15:23:20 Forwarded [POST] to /async-function/figlet - [500] - 0.000239 seconds

The result is after scaling NATS Streaming to 0 and back to 1 is:

func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | NatsQueue - submitting request
func_gateway.1.ftr96nc16pwc@linuxkit-025000000001    | stan: connection closed

@kozlovic
Copy link

If there is a request (calling Queue()) while the server is down and the connection is marked as closed, this is expected that the Publish() call would fail. How the returned error makes your software behave, that I don't know. It is also possible that even if the reconnect occurs, the Queue() still has a reference to the previous (now closed) connection and goes into a Publish() that would also result in error.

@bartsmykla
Copy link
Contributor Author

I'll test it tomorrow, but before NATS server will reconnect stan: connection closed is expected

@kozlovic
Copy link

Actually, you should get some error about nats: outbound buffer limit exceeded, and this is because we set the reconnect buffer to -1 to not buffer data while disconnected. And since by default the NATS connection created by STAN is set to reconnect for every, if you get connection closed, it could mean that your program closes the connection (maybe in response to a publish error?), which would prevent it to "reconnect" since the ConnectionLostHandler is - as documented - not invoked on an explicit connection close. So make sure that you are not explicitly closing the connection before the server has been restarted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants