Graceful shutdown #71

atitan · 2019-08-22T02:32:27Z

Hello,

We made the transition from puma to falcon recently, and it works really well.

We're seeing some random Interrupt exception in the log, found out to be caused by auto scaling's scale in, which send signal to containers to stop them.

I'd like to know if falcon was capable of handling graceful shutdown like puma do?

ioquatix · 2019-08-22T03:32:18Z

Not yet but it’s on track for the 1.0 release. If you have some time to help out, maybe you can explain your use case in a bit more detail.

atitan · 2019-08-22T04:34:39Z

We're using AWS EC2+ECS for production.
Every EC2 instance has 4 cores, where ECS places one task(containers) to each of them.
In the container, Falcon is forked to 5 processes to keep maximum usage of CPUs.

For deployment, we use full rolling update, which starts new container on every instances while old one still running.
ELB+ECS will eventually drain old containers and switch to new ones, so the signal exception will not cause error here.

When it comes to instance auto scaling, it operates directly on instances because it is a feature in EC2.
At the time instance getting terminated, it may have requests still being processed.

Hope the described scenario would help.

ioquatix · 2019-08-22T06:57:58Z

Thanks, that is helpful.

Are you using falcon serve and sigint?

ioquatix · 2019-08-22T06:59:10Z

Also it would be great to add a list of websites using falcon, or have used falcon. If you want to make a PR, including a link to your website, we can add it to the README.md.

ioquatix · 2019-08-22T07:31:58Z

I've got a cold so I may be unattentive on this issue for the next week, but hopefully I can pick it up soon. I've just spent some time playing around with async-io to flesh out what is required support graceful shutdowns in general, then that can be used by falcon.

atitan · 2019-08-22T12:56:44Z

Take care.

We use dumb-init as container entrypoint to rewrite and forward SIGTERM with SIGINT.
It's because Puma once had issue handling SIGTERM, which makes GKE unhappy about container status.
Using SIGINT makes Puma to return 0 on exit.
The outcome seemed great so we adopted this practice on every project we had since then.

veesahni · 2021-11-01T02:16:39Z

I have similar need for a rolling restart. Slightly different than what @atitan described, so I'd like to describe the use case:

We have nginx proxying to multiple ruby servers
We do a rolling restart of the ruby servers
falcon would be managed by systemd
systemd would be configured to send a signal defined by falcon (SIGQUIT, SIGTERM, etc) to the controller process to trigger a graceful shutdown
Upon receiving the signal, it is expected that falcon immediately stop listening for and accepting new connections.. and thus nginx will take the node out of rotation automatically
Upon receiving the signal, it is also expected that any idle child processes processes shut themselves down
Once all idle processes are gone, it's expected that the main controller will shut down
If the main controller is still up after 90s (this is systemd default), systemd will send a SIGKILL to forcefully bring it down
Once the falcon is shutdown (either via graceful shutdown or via SIGKILL), systemd will start the process

In practice, we setup systemd with a KillSignal parameter configured for a graceful shutdown.
Thus systemctl stop falcon would gracefully shutdown..
.. and systemctl restart falcon would be nothing special - just a stop followed by a start.

More details about systemd's standard restart procedure is described in man systemd.kill or see here, online

yard · 2024-03-14T17:10:42Z

Hey guys! I have played a bit with graceful shutdown for falcon host trying to make it work locally but unfortunately didn't manage to make it work reliably. Will dump my approach and findings here, maybe someone (well, @ioquatix most likely :) will be able to suggest how to proceed.

Essentially, I decided to approach the graceful shutdown in the dumbest way possible: upon receiving a signal (I was handling INT, but it doesn't really matter and can be made configurable anyway) I invoke a (newly added) drain! method on the SharedEndpoint instance. This method, in its turn, is expected to stop accepting new connections immediately, wait for currently running requests to complete and shut down any active connections hanging around (there could be some even after completing all the request due to Keep-Alive).

I managed to get SharedEndpoint to stop accepting new connections by running calls to accept method in a loop with a small timeout until a flag is set (which gets enabled in drain! method). Once we complete handling the request(s), the connection(s) get closed, the loop exits, the application gets killed and everything is nice.

What I didn't manage to handle properly is exactly the Keep-Alive case. Whenever a connection is not closed by the client, it just stays in #each loop waiting for new requests and does not die – and since I was thinking about adding another signalling like drain! to connection object inside Async::HTTP::Server accept loop, but figured it might be better to ask first.

To sum up, I think the main challenge with my implementation is about implementing a proper signalling mechanism that can stop all these nested loops properly – almost like we use Thread.current.raise(Interrupt) to stop whatever code is running on INT signal normally (but we actually need something more fine-grained, since we don't want to stop the client logic). Adding drain! method throughout the whole stack seems weird and I believe there might be a better pattern?

ioquatix · 2024-03-14T19:08:03Z

I agree with improving this. The changes will need to take place in falcon and async-container. Using signals for everything might be tricky - there aren't really enough, so it might be the case we need to use some kind of RPC for detailed rolling restarts + status information.

I believe for the current implementation, sending SIGINT should start the shutdown process, followed by SIGTERM (or SIGKILL) to stop everything.

yard · 2024-03-15T17:53:23Z

Well, as far as graceful shutdown are considered – I think signals should be enough. All in all, the child processes/threads (spawned by a Container instance) should gracefully teardown, and the parent process should exit once these are done; doing that in a reaction to a signal seems alright.

The area I am a bit stuck at is terminating the connections gracefully. A SharedEndpoint instance created in the parent process gets passed down to child threads, which run an accept loop on top of it, where evert accepted connection is wrapped into a socket, where a stream is created off a socket (if I am correct with the hierarchy). This makes it a tad hard to handle properly as I don't see a nice (and preferably universal) way to signal every stream to tear down – is there one though? Or we do have to push some sort of a drain! chain of calls down to the very last level of nesting?

ioquatix · 2024-03-24T09:45:57Z

I will revisit this problem this week. I understand the use case.

ioquatix · 2024-03-25T06:51:30Z

Okay, here is where I'm landing on this:

Async::IO::SharedEndpoint (and IO::Endpoint::BoundEndpoint) both support #close which terminates the accept loop. See socketry/async-io#80 and socketry/io-endpoint#6 for details (and tests showing how it works).

Async::Container::Controller will introduce a graceful parameter which controls the shutdown sequence:

false: instant SIGTERM.
true: 1 second timeout.
Integer/Float: the timeout.

If a timeout is set, the existing logic handles it as so:

Send child processes SIGINT and wait for the given timeout.
Send child processes SIGTERM.

We do not attempt to send multiple SIGINT signals. That means, with graceful = 5 for example, the child process will receive SIGINT followed by SIGTERM 5 seconds later if it has not exited.

Falcon::Service::Server will intercept SIGINT and call @bound_endpoint.close. New connections won't be accepted. After the timeout, the server will receive SIGTERM which will immediately and forcefully terminate the event loop.

One part that's missing from this is broadcasting SIGINT further, i.e. if you have persistent connections like WebSockets, you might like to begin the shutdown sequence. It's not realistic for every WebSocket connection to intercept SIGINT... In other words, not accepting new connections is one piece. However, for persistent connections, what we want to do is also stop serving new requests, and for streaming responses, we want to start the shutdown process. I need to think a bit more about this signalling mechanism should work.

ioquatix · 2024-03-25T07:47:47Z

I think we will need to mark the request/response life cycle in order to defer the shutdown sequence, e.g. a bit like Thread#handle_interrupt.

In other words:

while request = read_request
  Async::Task.defer_stop do
    write_response(generate_response)
  end
  # If a shutdown caused this task to stop, we will stop here. If we took took long, SIGTERM will kill the above block.
end

I'll need to think about the design carefully, but this would make sense for most code which has to handle graceful shutdown.

ioquatix · 2024-03-26T02:24:12Z

socketry/async-container#31 adds tests and support for graceful stop. It was already implemented, but not exposed to user control.

zarqman · 2024-03-28T16:25:22Z

@ioquatix, I added graceful stop to our async-http powered app here and thought I'd share a couple notes on the current design in case it helps.

To handle keep-alive and websockets, we added a new field, Async::Task#shutdown_data. It's value is an indicator as to how to handle the shutdown for that connection / task.

HTTP1 connections set it to :h1_connection right before waiting for a new request line, and clear it once the HTTP1 request has been received and is actively processing. When present, it indicates the HTTP1 underlying connection can be terminated freely, i.e. when waiting for a new request during a keep-alive connection.

HTTP2 connections set it to [:h2_connection, connection] inside our HttpService < Async::HTTP::Server class. Since the HTTP2 protocol has its own graceful close facility, this is set at the very beginning of the connection / request cycle.

Finally, for websockets, it's set to :ws_stream on both the up and down tasks. (We use dedicated tasks to handle each side of the established WS connection.)

When a graceful stop is triggered, our HttpService < Async::HTTP::Server class traverses its entire task list, checking for tasks that are running? and have shutdown_data. If present, it calls subtask.stop for both HTTP1 and WS, and for HTTP2, calls conn.send_goaway(0) unless conn.closed?. Tasks that don't have shutdown_data are ignored. (WS may eventually get a more proper shutdown than simply stopping the tasks and closing the connections, but that's sufficient for our present needs.)

Additionally, on HTTP1, at the end of the response generation, if our HttpService < Async::HTTP::Server class has been marked stopping (aka in graceful shutdown), connection.persistent = false is forced right before returning, which catches the in-flight requests that the task list traversal above would have skipped.

I won't speak to the trigger mechanism as our app here is threads-only and doesn't use async-container, so it's probably mostly not applicable. Same as your recent merges, it closes the inbound shared endpoints first, then triggers the graceful stop on each thread.

There might be a better place to add shutdown_data than on Async::Task (and a better name), but that seemed handy enough for our purposes. If placing it on Async::Task is warranted, perhaps making it into a callback would keep it from being http-specific.

Worth noting that we also rely on socket timeouts everywhere and never have sockets without them. I don't recall whether we're relying on this separately from the above anywhere. FWIW, our graceful shutdown timeout is longer than our socket timeouts.

Anyway, hope that provides food for thought. Happy to answer questions or share code snippets if such would be helpful.

ioquatix · 2024-03-29T04:29:53Z

Thanks @zarqman that's extremely helpful to understand your use case.

It will take some iterations to get the exact behaviour correct, but I do like the simplicity of the defer_stop model. We probably just need to ensure that it's scoped around the right parts of the code, typically the response part.

I suppose any kind of graceful shutdown is best effort, but I'd love if our best graceful shutdown was extremely good and left clients in a good state. As you said, this also includes things like WebSockets.

For WebSockets, it's a little different, and I imagine we want something like this:

begin
  websocket = receive_inbound_websocket
  while message = queue.next_message
    websocket.send(message)
  end
ensure
  websocket.close # When SIGINT -> Async::Stop occurs, we will trigger this.
  # We should ensure that `websocket.close` correctly shuts down the connection.
end

My current thinking is that the above logic is sufficient for correctly shutting down a WebSocket. The ensure block will execute in response to the Async::Stop, but internally we might still block during the ensure block while sending the close frame.

yard · 2024-04-03T14:38:30Z

My 5c: like defer_stop approach, but a small clarification would be helpful. This would only prevent #stop from functioning on the top-level task (=effectively, in response to a shutdown sequence), right? The user's code can still rely on #stop and it's not a good idea to "mask" all the invocations as otherwise processing requests during graceful shutdown might break.

Another bit to watch out for is keep-alive connections. IIRC closing the bound endpoint won't do anything to those, so we still need to stop listening for new http requests on connections which are currently active (maybe close them as soon as the response to the last one is written or smth).

ioquatix · 2024-04-04T06:54:10Z

My 5c: like defer_stop approach, but a small clarification would be helpful. This would only prevent #stop from functioning on the top-level task (=effectively, in response to a shutdown sequence), right? The user's code can still rely on #stop and it's not a good idea to "mask" all the invocations as otherwise processing requests during graceful shutdown might break.

More generally, there is nothing stopping user code doing this:

while true
  begin
    sleep
  rescue => Async::Stop
    # ignore
  end
end

There is simply not guarantee that it's possible to prevent stopping. However, Ruby 3.3 introduced Fiber#kill which is much harder to ignore. In the future, that may be used in Async::Node#terminate.

Therefore, I don't think it really matters - bad code is bad code - good code is good code, and stopping will always be tricky. Additional features to mask user-level stop or some other feature won't solve the problem, but will make things more complicated, which I'd like to avoid.

Another bit to watch out for is keep-alive connections. IIRC closing the bound endpoint won't do anything to those, so we still need to stop listening for new http requests on connections which are currently active (maybe close them as soon as the response to the last one is written or smth).

When we issue stop, it will cause the accept loop to shut down. So no new connections would be shut down. In addition, the HTTP/1 connection request/response loop will terminate after the response is finished sending. HTTP/2 is a little more tricky, but conceptually similar. In fact, in most cases, you don't need to worry about closing the bound endpoint, as you actually want the new process to start accepting the inbound connections, and thus don't want to close the bound endpoint (but of course the accept loop in the old processes is terminated).

yard · 2024-04-04T18:32:29Z

More generally, there is nothing stopping user code doing this:

You are right, but my point is that the user code should not be stopped during graceful termination. Sure thing, when the timeout expires and the code still runs, we are happy to kill everything, but while there is time left, the user's code should "just work" – including awaiting the responses from APIs, database and whatnot. Otherwise, it's not exactly a graceful termination :)

In addition, the HTTP/1 connection request/response loop will terminate after the response is finished sending.

What would happen if the request is still being read during the graceful termination initiation? Will it be read and full and then executed or the read will bail out with an error?

ioquatix · 2024-04-05T07:48:07Z

You are right, but my point is that the user code should not be stopped during graceful termination. Sure thing, when the timeout expires and the code still runs, we are happy to kill everything, but while there is time left, the user's code should "just work" – including awaiting the responses from APIs, database and whatnot. Otherwise, it's not exactly a graceful termination :)

Yes, agreed.

What would happen if the request is still being read during the graceful termination initiation? Will it be read and full and then executed or the read will bail out with an error?

This is much more tricky.

I suppose for HTTP/1, we need to respond at least once with connection: close, so that the client does not try to send any more requests, or in other words, the client is informed clearly that the connection is shutting down.

For HTTP/2, there might be a similar mechanic we can use.

zarqman · 2024-04-05T16:33:22Z

In addition, the HTTP/1 connection request/response loop will terminate after the response is finished sending.

This needs to happen, but the defer_stop model cannot do this alone. The HTTP1 server will need a stopping = true flag that will cause it to not loop in Async::HTTP::Protocol::HTTP1::Server#each after the end of the present request. We can also use this flag at the end of the request to override persistent? to false and send connection: close.

It may be possible to reuse/override @persistent = false (instead of a new stopping flag) on each open connection. There is a slight risk of a race here though in the early stages of the request (after read_request_line and before @persistent=... finishes in read_request.)

What would happen if the request is still being read during the graceful termination initiation? Will it be read and full and then executed or the read will bail out with an error?

With a flag, as long as the request has started (the first line has been read), it's no problem to let it finish.

I suppose for HTTP/1, we need to respond at least once with connection: close, so that the client does not try to send any more requests, or in other words, the client is informed clearly that the connection is shutting down.

If there's an active request, we should definitely respond with connection: close. However, if we're already waiting for a 2nd+ (ie: @count >= 1) request (specifically, waiting in h1's read_request_line), we should just close the connection immediately. But, if it's the first request (@count == 0), which just snuck in before the graceful stop, we should still process it.

In my experience, h1.1 clients are generally aware that the 2nd+ request might fail and often have quick retry mechanisms. For the first request, that's not a given though (doubly so for h1.0 clients).

HTTP/2 is a little more tricky, but conceptually similar.

Shouldn't we trigger h2's own graceful stop mechanism (send_goaway(0)) for each existing connection too? Otherwise existing h2 connections will be allowed to continue initiating new streams (aka keep-alive) and thus graceful shutdown will either never complete or eventually have to be forcefully shutdown.

This is particularly important when async-http or falcon is fronted by an LB that multiplexes inbound requests into a smaller number of outbound h2 requests to the backend and reuses those h2 connections extensively. Actually, same is true for h1 LBs that also perform connection reuse.

ioquatix mentioned this issue Aug 22, 2019

Echo server with graceful shutdown. socketry/async-io#31

Closed

atitan mentioned this issue Aug 23, 2019

Add Websites using Falcon section to README #72

Merged

This was referenced Feb 16, 2021

falcon support sinatra/sinatra#1639

Closed

Adding falcon integration sinatra/sinatra#1677

Closed

ioquatix self-assigned this Nov 2, 2021

trevorturk mentioned this issue Feb 21, 2023

Gracefully restart running application #188

Open

ioquatix mentioned this issue Mar 25, 2024

Graceful shutdown socketry/async-container#31

Merged

3 tasks

ioquatix mentioned this issue Mar 27, 2024

Defer stop socketry/async-http#153

Merged

3 tasks

zarqman mentioned this issue Apr 5, 2024

Can no longer force stop an HTTP server task socketry/async-http#154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful shutdown #71

Graceful shutdown #71

atitan commented Aug 22, 2019

ioquatix commented Aug 22, 2019

atitan commented Aug 22, 2019

ioquatix commented Aug 22, 2019

ioquatix commented Aug 22, 2019

ioquatix commented Aug 22, 2019

atitan commented Aug 22, 2019

veesahni commented Nov 1, 2021 •

edited

Loading

yard commented Mar 14, 2024

ioquatix commented Mar 14, 2024 •

edited

Loading

yard commented Mar 15, 2024

ioquatix commented Mar 24, 2024

ioquatix commented Mar 25, 2024

ioquatix commented Mar 25, 2024 •

edited

Loading

ioquatix commented Mar 26, 2024

zarqman commented Mar 28, 2024

ioquatix commented Mar 29, 2024 •

edited

Loading

yard commented Apr 3, 2024

ioquatix commented Apr 4, 2024

yard commented Apr 4, 2024 •

edited

Loading

ioquatix commented Apr 5, 2024

zarqman commented Apr 5, 2024

Graceful shutdown #71

Graceful shutdown #71

Comments

atitan commented Aug 22, 2019

ioquatix commented Aug 22, 2019

atitan commented Aug 22, 2019

ioquatix commented Aug 22, 2019

ioquatix commented Aug 22, 2019

ioquatix commented Aug 22, 2019

atitan commented Aug 22, 2019

veesahni commented Nov 1, 2021 • edited Loading

yard commented Mar 14, 2024

ioquatix commented Mar 14, 2024 • edited Loading

yard commented Mar 15, 2024

ioquatix commented Mar 24, 2024

ioquatix commented Mar 25, 2024

ioquatix commented Mar 25, 2024 • edited Loading

ioquatix commented Mar 26, 2024

zarqman commented Mar 28, 2024

ioquatix commented Mar 29, 2024 • edited Loading

yard commented Apr 3, 2024

ioquatix commented Apr 4, 2024

yard commented Apr 4, 2024 • edited Loading

ioquatix commented Apr 5, 2024

zarqman commented Apr 5, 2024

veesahni commented Nov 1, 2021 •

edited

Loading

ioquatix commented Mar 14, 2024 •

edited

Loading

ioquatix commented Mar 25, 2024 •

edited

Loading

ioquatix commented Mar 29, 2024 •

edited

Loading

yard commented Apr 4, 2024 •

edited

Loading