Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Agent Restart Processes #677

Open
mpass99 opened this issue Sep 4, 2024 · 4 comments
Open

Nomad Agent Restart Processes #677

mpass99 opened this issue Sep 4, 2024 · 4 comments
Labels
bug Something isn't working pending

Comments

@mpass99
Copy link
Contributor

mpass99 commented Sep 4, 2024

Related to #612.

When restarting a Nomad Agents, we find many systemd warnings.

  • nomad.service: Unit process 3620 (nomad) remains running after unit stopped.
  • nomad.service: Found left-over process 3620 (nomad) in control group while starting unit. Ignoring.
  • nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
  1. Create minimal reproduction steps
  2. Investigate Nomads Graceful Shutdown behavior and if the systemd restart waits for it
  3. Create an upstream issue if appropriate
@mpass99 mpass99 added the bug Something isn't working label Sep 4, 2024
@mpass99
Copy link
Contributor Author

mpass99 commented Sep 9, 2024

Minimal Reproduction

  1. Setup Ubuntu Instance
  2. Install Docker https://docs.docker.com/engine/install/ubuntu/
  3. Install Nomad https://developer.hashicorp.com/nomad/install
  4. Start Nomad systemctl start nomad.service
  5. Start some Job
  6. Restart Nomad systemctl restart nomad.service

(No) Issue

The above-mentioned warnings appear. However, Nomad accepts them as discussed in hashicorp/nomad#17134 to allow in-place upgrades.

Nomads Graceful Shutdown behavior

Systemd and the Nomad Process does not wait for the graceful shutdown deadline. Instead, Nomad restarts directly. As described in the linked Issue, the process of handling the workload remains running. When starting again, Nomad takes again control of the workload and stops it when the deadline is reached.

Nomad Logs

Starting Restart Mon Sep 9 13:59:50 UTC 2024

2024-09-09T13:59:50.853969+00:00 nomad-agent-terraform-3 systemd[1]: Stopping nomad.service - Nomad...
2024-09-09T13:59:50.854424+00:00 nomad-agent-terraform-3 nomad[2008]: ==> Caught signal: interrupt
2024-09-09T13:59:50.854528+00:00 nomad-agent-terraform-3 nomad[2008]: ==> Gracefully shutting down agent...
2024-09-09T13:59:50.863290+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:50.863Z [INFO]  client.drain: monitoring self-drain
2024-09-09T13:59:55.854929+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.854Z [INFO]  agent: requesting shutdown
2024-09-09T13:59:55.855082+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.854Z [INFO]  client: shutting down
2024-09-09T13:59:55.857090+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.856Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
2024-09-09T13:59:55.857148+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.856Z [INFO]  client.plugin: plugin manager finished: plugin-type=device
2024-09-09T13:59:55.857211+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.856Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
2024-09-09T13:59:55.864934+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.864Z [INFO]  client.plugin: plugin manager finished: plugin-type=driver
2024-09-09T13:59:55.864986+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.864Z [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
2024-09-09T13:59:55.865024+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.864Z [INFO]  client.plugin: plugin manager finished: plugin-type=csi
2024-09-09T13:59:55.876838+00:00 nomad-agent-terraform-3 nomad[2008]:     2024-09-09T13:59:55.876Z [INFO]  agent: shutdown complete
2024-09-09T13:59:55.883507+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
2024-09-09T13:59:55.883672+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Failed with result 'exit-code'.
2024-09-09T13:59:55.883804+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Unit process 4096 (nomad) remains running after unit stopped.
2024-09-09T13:59:55.883882+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Unit process 4174 (nomad) remains running after unit stopped.
2024-09-09T13:59:55.884372+00:00 nomad-agent-terraform-3 systemd[1]: Stopped nomad.service - Nomad.
2024-09-09T13:59:55.884488+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Consumed 12.428s CPU time, 48.3M memory peak, 0B memory swap peak.
2024-09-09T13:59:55.885765+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Found left-over process 4096 (nomad) in control group while starting unit. Ignoring.
2024-09-09T13:59:55.885858+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-09-09T13:59:55.886000+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: Found left-over process 4174 (nomad) in control group while starting unit. Ignoring.
2024-09-09T13:59:55.886079+00:00 nomad-agent-terraform-3 systemd[1]: nomad.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-09-09T13:59:55.891120+00:00 nomad-agent-terraform-3 systemd[1]: Starting nomad.service - Nomad...
2024-09-09T13:59:55.945034+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Loaded configuration from /etc/nomad.d/nomad.hcl
2024-09-09T13:59:55.945199+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Starting Nomad agent...
2024-09-09T13:59:57.052896+00:00 nomad-agent-terraform-3 nomad[4243]: 2024-09-09T13:59:57.052Z [TRACE] plugin.stdio: waiting for stdio data
2024-09-09T13:59:57.058520+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Nomad agent configuration:
2024-09-09T13:59:57.058583+00:00 nomad-agent-terraform-3 nomad[4243]:        Advertise Addrs: HTTP: 10.224.6.145:4646
2024-09-09T13:59:57.058603+00:00 nomad-agent-terraform-3 nomad[4243]:             Bind Addrs: HTTP: [[::]:4646]
2024-09-09T13:59:57.058619+00:00 nomad-agent-terraform-3 nomad[4243]:                 Client: true
2024-09-09T13:59:57.058638+00:00 nomad-agent-terraform-3 nomad[4243]:              Log Level: INFO
2024-09-09T13:59:57.058666+00:00 nomad-agent-terraform-3 nomad[4243]:                 Region: global (DC: dc1)
2024-09-09T13:59:57.058691+00:00 nomad-agent-terraform-3 nomad[4243]:                 Server: false
2024-09-09T13:59:57.058706+00:00 nomad-agent-terraform-3 nomad[4243]:                Version: 1.8.3
2024-09-09T13:59:57.058724+00:00 nomad-agent-terraform-3 nomad[4243]: ==> Nomad agent started!
2024-09-09T13:59:57.063703+00:00 nomad-agent-terraform-3 nomad[4243]:     2024-09-09T13:59:57.041Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=fe1edbdd-d954-17cc-0bcd-9aec851fa87e task=default-task type=Received msg="Task received by client" failed=false
2024-09-09T14:00:06.886456+00:00 nomad-agent-terraform-3 nomad[4243]:     2024-09-09T14:00:06.886Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=fe1edbdd-d954-17cc-0bcd-9aec851fa87e task=default-task type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false

@MrSerth
Copy link
Member

MrSerth commented Sep 10, 2024

Since we have leave_on_interrupt, leave_on_terminate, and drain_on_shutdown configured, I would expect an actual shutdown of all Nomad-related processes (that's how I understood the issue description). However, that's not case, is it?

@mpass99
Copy link
Contributor Author

mpass99 commented Sep 11, 2024

I would expect an actual shutdown of all Nomad-related processes (that's how I understood the issue description

I agree. That is also what the docs state.

If this value is set to true on a client agent and the client is configured with drain_on_shutdown, the client will drain its workloads before shutting down.

However, that's not the case. A restart of Nomad on an Agent takes about 6 seconds. Some subprocesses keep running after Nomad stopped and the draining is handled after Nomad started again.

Systemctl seems to be not the cause for this as it would wait up to 90 seconds by default [1]. TimeoutStopSec is not overwritten in Nomad's service definition.

Maybe this question will resolve in the context of hashicorp/nomad#23937.

@MrSerth
Copy link
Member

MrSerth commented Sep 25, 2024

We are waiting for a response in the upstream issue. Until then, we are blocked and this issue is pending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending
Projects
None yet
Development

No branches or pull requests

3 participants
@MrSerth @mpass99 and others