Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid stopping nginx-agent service on package upgrade #352

Merged
merged 7 commits into from
Nov 17, 2023

Conversation

defanator
Copy link
Contributor

@defanator defanator commented Jun 21, 2023

Proposed changes

This change introduces a number of improvements around packaging nginx-agent to ensure that the following behavior is maintained:

  1. The nginx-agent service is called with stop argument on package removal.
  2. The nginx-agent service is called with restart argument on package upgrade if a service is running at the moment of package upgrade (previously, a service has just been stopped without any notice).
  3. On fresh installations, or on upgrades when nginx-agent service from older package is not running, it will not be started automatically.

While here, added OpenRC init scripts to make service nginx-agent command fully functional on Alpine Linux.

Tested on:

  • Debian 11 (.deb)
  • RHEL 8 (.rpm)
  • Alpine 3.18 (.apk)

Unfortunately, FreeBSD is failing to follow the 2nd rule (restart on upgrade if a service was running) due to limitation of the pkg package manager (see freebsd/pkg#2128 for more details).

Fixes #303.

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING document
  • I have run make install-tools and have attached any dependency changes to this pull request
  • If applicable, I have added tests that prove my fix is effective or that my feature works
  • If applicable, I have checked that any relevant tests pass after adding my changes
  • If applicable, I have updated any relevant documentation (README.md)
  • If applicable, I have tested my cross-platform changes on Ubuntu 22, Redhat 8, SUSE 15 and FreeBSD 13

@netlify
Copy link

netlify bot commented Jun 21, 2023

Deploy Preview for agent-public-docs canceled.

Name Link
🔨 Latest commit 9d70b92
🔍 Latest deploy log https://app.netlify.com/sites/agent-public-docs/deploys/65549d7ae600b00007989eee

@github-actions github-actions bot added bug Something isn't working documentation Improvements or additions to documentation labels Jun 21, 2023
@defanator
Copy link
Contributor Author

Tagging @thresheek @fblr @oxpa for the sake of visibility.

Copy link
Collaborator

@dhurley dhurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the addition of openrc for alpine looks good, but I'm not sure about changing the current behaviour of the agent during upgrade. We require the agent to be stopped by the user before upgrade and then started again after the upgrade by the user.

scripts/packages/nginx-agent.openrc Outdated Show resolved Hide resolved
scripts/packages/postinstall.sh Outdated Show resolved Hide resolved
@defanator
Copy link
Contributor Author

We require the agent to be stopped by the user before upgrade and then started again after the upgrade by the user.

@dhurley this is quite counterintuitive and stumbling behavior; what was the rationale behind that decision?

To elaborate on a context: we came across the situation when nginx-agent service silently stopped during the usual package upgrade process with the appropriate package manager (yum/dnf, apt, apk) on a large fleet of instances. There were no any errors from corresponding package scripts during the upgrade, and there were no any signs of broken agent's configuration etc - so from the administrator's point of view, upgrade went fine - except the fact that we ended up with stopped monitoring for all affected instances.

I'm wondering where that requirement of manual stopping and starting came from. Agent's process can be restarted gracefully, so why prevent it to do so?

@dhurley
Copy link
Collaborator

dhurley commented Jun 27, 2023

We require the agent to be stopped by the user before upgrade and then started again after the upgrade by the user.

@dhurley this is quite counterintuitive and stumbling behavior; what was the rationale behind that decision?

To elaborate on a context: we came across the situation when nginx-agent service silently stopped during the usual package upgrade process with the appropriate package manager (yum/dnf, apt, apk) on a large fleet of instances. There were no any errors from corresponding package scripts during the upgrade, and there were no any signs of broken agent's configuration etc - so from the administrator's point of view, upgrade went fine - except the fact that we ended up with stopped monitoring for all affected instances.

I'm wondering where that requirement of manual stopping and starting came from. Agent's process can be restarted gracefully, so why prevent it to do so?

Hi @defanator, actually it looks like our docs might be a bit out of date. Yes you actually shouldn't need to stop the agent before upgrading. When we upgrade the agent, the old version of the agent should continue to run and its not until the user restarts the agent that the new version of the agent is used. Sorry about the confusion.

Copy link
Collaborator

@dhurley dhurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the issue where the agent was being stopped during upgrades. I just have questions related to upgrading on alpine.

scripts/packages/postremove.sh Outdated Show resolved Hide resolved
scripts/packages/postupgrade.sh Show resolved Hide resolved
@defanator
Copy link
Contributor Author

When we upgrade the agent, the old version of the agent should continue to run and its not until the user restarts the agent that the new version of the agent is used.

@dhurley such behavior will introduce an inconsistent setup when your running binary will no longer be available on disk, which is potentially just another delayed point of failure (aka time bomb): imagine a situation when users won't restart service after upgrade till system reboot or another component(s) upgrade that will involve restarting services affected by particular changed shared library - which is a common scenario in modern Ubuntu/Debian - in this situation, in case of any unexpected issues with agent, it will be impossible to just restart the service without doing rollback to previous package version.

Restarting service on upgrade (and failing immediately if a restart was unsuccessful) solves this inconsistency + allows a user to be alarmed on possible issues as soon as possible.

We spent a while building similar patterns for other nginx products (just noticing - given the fact agent's trying to follow established principles / support matrix of nginx/nginx-plus); I'm happy to elaborate more on this if required.

@defanator
Copy link
Contributor Author

@dhurley @oliveromahony any updates on this one?

@defanator
Copy link
Contributor Author

defanator commented Nov 15, 2023

TWIMC, rebased this against latest main and tested on these platforms:

  • Debian 11 "bullseye",
  • Alpine 3.18,
  • RHEL 9.3.

Testing included:

  1. Building local .deb/.apk/.rpm with VERSION=v2.30.3 (latest tagged) and VERSION=v2.30.4 (bumped for tests).
  2. Installing v2.30.3 package on target OS, without any previous version, verifying that the nginx-agent service wasn't started after initial installation.
  3. Upgrading v2.30.3 to v2.30.4 via standard package manager tool on target OS, verifying that the service was restarted after upgrade.
  4. Removing the package via standard package manager tool, verifying that the service was stopped before package removal/purge.

@oliveromahony oliveromahony merged commit 89fc334 into nginx:main Nov 17, 2023
28 checks passed
@dhurley dhurley mentioned this pull request Dec 6, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uninstalling agent duplicates messages
3 participants