Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(deployment): simplify probes command to reduce false positives #135

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

radicand
Copy link
Contributor

@radicand radicand commented Sep 7, 2024

In some (rare - one documented in #134) instances, the deployment liveness/readiness probes may time out while running. Investigation into this seems to be that the supervisorctl status command returned the right information, but somewhere in the double grep chaining the command could hang.

The good news is there is no need for the double grepping. supervisorctl status will return an exit code of 0 when all services provided as arguments are in the RUNNING state, which is what we're intending to check anyway. If any services are not running, supervisorctl status returns a non-zero exit code which k8s will interpret as a probe failure.

This cleaner method is actually more reliable regardless, as it could be the case in the previous implementation that the probe would not fail if one of the key services was not running, but any one of the specified services was.

Finally, note that in this pull request I only included the baseline services that all installations use - there is an opportunity to conditionally include more services in the check based on what the user has chosen to enable in their values.yaml.

radicand and others added 3 commits September 4, 2024 19:08
The double-grep behavior led to the command unintentionally hanging in some cases, prompting k8s to continually restart the pod without valid reason.
note: there is optional work remaining to conditionally add more services to liveness probe based on what the user has chosen to enable (or things that may perhaps always be runnig).
@@ -172,7 +172,8 @@ spec:
command:
- /bin/bash
- -c
- supervisorctl status | grep -E "amavis|clamav|cron|dovecot|mailserver|opendkim|opendmarc|postfix|rsyslog" | grep RUNNING
- supervisorctl status cron mailserver postfix rsyslog
# TODO: conditionally? add 'amavis changedetector clamav cron dovecot fail2ban fetchmail mta-sts-daemon opendkim opendmarc postgrey postsrsd rspamd rspamd-redis saslauthd_ldap saslauthd_mysql saslauthd_pam saslauthd_rimap saslauthd_shadow'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you were wanting to do a broader status check like that, you might as well contribute it upstream so the call here could be a simple command still like dms-healthcheck?

That'd be a shell script that has a list of base features, followed by some logic to append services based on ENV (or probably the env file we write during startup that contains defaults/fallbacks). That way you're not querying status of services that are intentionally not running.

It'd also be simpler to maintain since it'd be in sync with the upstream DMS releases, and those not using helm can also benefit from the simplified healthcheck.

Regarding saslauthd, we only really support ldap and rimap AFAIK, the others were added by the contributor of saslauthd support but we have no test coverage or support offered for such. changedetector presently does not work with LDAP either.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so how would we like to resolve this? I'm find with the PR, but @polarathene sounds like a good suggestion too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants