Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

st2actionrunner graceful shutdown #86

Open
guzzijones opened this issue Aug 24, 2021 · 6 comments
Open

st2actionrunner graceful shutdown #86

guzzijones opened this issue Aug 24, 2021 · 6 comments

Comments

@guzzijones
Copy link

guzzijones commented Aug 24, 2021

This ticket will hold research into graceful shutdown of st2actionrunner. This is in anticipation of adding a way through OS or otherwise to allow us to scale st2actionrunners based on some factor.

My initial research led me to this section of code where the st2actionrunner takes ownership of a scheduled action:
st2actionrunner takes ownership

The st2actionrunner abandon code is here:
st2actionrunner abandon code

The teardown for the parent process is here:
st2actionrunner teardown

We are probably going to create a custom heartbeat script that monitors the number of st2actionrunner processes on a vm to tell the autoscaler to wait until the work is done.

import boto3

response = client.record_lifecycle_action_heartbeat(
    LifecycleHookName='string',
    AutoScalingGroupName='string',
    LifecycleActionToken='string',
    InstanceId='string'
)
@guzzijones
Copy link
Author

Another possiblity is for the autoscaler system to query if the st2actionrunner being shutdown has taken ownership of any jobs. If so wait until it no longer has ownership.

@nzlosh
Copy link

nzlosh commented Aug 24, 2021

What is an autoscaler in this context?

@guzzijones
Copy link
Author

aws dynamic autoscaling policy

@arm4b
Copy link
Member

arm4b commented Aug 24, 2021

Do we need some kind of way to mark the specific st2actionrunner as "unschedulable"?
Otherwise, in a heavily used st2 dynamic environments it'll pick up the next task from the queue once the previous one is finished.

Talking about the mechanisms.
Maybe sending the SIGTERM signal (or other signal) to st2actionrunner process so it'll stop picking up new jobs and finish an old one?
Or do we need something more advanced, like a new API endpoint to drain the st2actionrunner?

@guzzijones
Copy link
Author

It looks like a SIGTERM is all that is needed. Then the st2actionrunner will pop the message back for scheduling and die. The only problem is AWS Dynamic Scaling will immediatly kill the VM unless you use the boto3.record_livecycle_action_heartbeat to tell AWS to wait while it is still shutting down the process. I see this as a python script that would be supplemental and specific to AWS autoscaling. I don't even think it should be part of core st2 codebase imo.

@arm4b
Copy link
Member

arm4b commented Aug 25, 2021

Yeah, right.
Higher level orchestrator/logic should give some time (like terminationGracePeriodSeconds) for st2actionrunner to finish its work after sending the signal.

In the context of K8s, when the pod is terminated it goes through the following lifecycle:

  • Pod is set to the “Terminating” State and removed from the endpoints list of all Services
  • A SIGTERM signal is sent to the main process in each container, and a “grace period” countdown starts.
  • Upon the receival of the SIGTERM, each container should start a graceful shutdown of the running application and exit.
  • Graceful shutdown period could be adjusted and configurable (up to a really long periods) to let the process (in our case st2actionrunner) finish its work.
  • If a container doesn’t terminate within the grace period, a SIGKILL signal will be sent and the container violently terminated.

More:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants