Avoid upgrade being killed by failed liveness probes #344

remram44 · 2023-01-27T17:01:39Z

Pull Request

Description of the change

This runs the upgrade in an init container, before the main container starts. It sets the command arguments to "true", so the container exits immediately after the upgrade, and sets the variable NEXTCLOUD_UPDATE to 1, without which the upgrade step is skipped because there are command arguments (see entrypoint).

Benefits

Avoid the container being stopped during the upgrade because of failed liveness probes, since init containers don't get probed.

Possible drawbacks

Runs an additional container, so it's a bit slower I guess.

Applicable issues

fixes Liveness probe kills pod while upgrading #333

Additional information

This also removes the nextcloud.update value. I don't see a way people could possibly have used it though, since it only does something when you pass different arguments to the image, and there is no value in this chart that will allow you to do that.

Checklist

DCO has been signed off on the commit.
Chart version bumped in Chart.yaml according to semver.
~~(optional) Variables are documented in the README.md~~

jessebot · 2023-01-27T17:06:27Z

Thanks for submitting a PR! :)

Will this always run an upgrade anytime the pod restarts or additional are spun up? Is there a way to disable upgrades until a user is ready?

remram44 · 2023-01-27T17:20:35Z

It will update the volume to the version of the image, so it's usually triggered by the user doing helm upgrade (or kubectl set image I guess). If a Pod is restarted without changing version, nothing will happen.

remram44 · 2023-02-09T18:40:00Z

This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)

Jeroen0494 · 2023-03-04T11:46:28Z

This doesn't solve the situation where you need to run the web installer. Probes should probably be changed not to fail if the web installer needs to be run (otherwise the pod never becomes ready because not installed, and you can't reach the web installer to install)

Should you even be using the web based upgrader at all when using container images? The image has the updated files. If you use the web-based upgrader but don't update your container image version, you'll break stuff for sure.

Jeroen0494 · 2023-03-04T11:55:43Z

You can disable the web-based updater to prevent people from borking their installation:

'upgrade.disable-web' => true,

https://docs.nextcloud.com/server/stable/admin_manual/configuration_server/config_sample_php_parameters.html
Search for the upgrade.disable-web option.

remram44 · 2023-03-04T16:37:38Z

Yeah I also think it should auto-install but that is not the default behavior of running helm install

4censord · 2023-10-26T13:15:02Z

Hey, what would be needed to get this merged?
Also, is it possible to backport this change to the last 3 versions, so 25, 26 and 27?

provokateurin · 2023-10-26T13:26:40Z

@4censord I haven't looked at this PR closely, but the conflicts need to be solved and the changes need to be reviewed. With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.

4censord · 2023-10-26T13:37:05Z

I'll take a look at resolving the conflicts.

With this chart there is no backporting, you just overrride nextcloud.tag with your desired version.

Ok

4censord · 2023-10-26T14:06:43Z

The conflicts are solvable by simply rebasing onto the main branch.
@remram44 can you rebase, please?
Or if you'd be fine with that, i could open a new PR with your changes, and we close this one.

remram44 · 2023-10-26T14:07:41Z

If I rebase, will somebody review? I don't like to put in work for it to be ignored

budimanjojo · 2023-10-29T06:14:01Z

@jessebot can you please take a look at this? this is a longstanding issue with this chart and currently the only way to prevent upgrade failing is to disable all probes which is not really a solution.

I'm using fluxcd gitops tool to manage my cluster, and when the helm upgrade fails because it's taking longer than the probes, fluxcd will rollback the pod to the last working state. But nextcloud doesn't want to start the last working state too because of this error:

Can't start Nextcloud because the version of the data (27.1.3.2) is higher than the docker image version (27.1.2.1) and downgrading is not supported.
Are you sure you have pulled the newest image version?

And I will be left with the pod crashlooping forever until I manually fix it using occ commands.

4censord · 2023-10-29T17:52:05Z

And I will be left with the pod crashlooping forever until I manually fix it using occ commands.

How exactly are you fixing this?
I am left on 25.0.1 currently, and have yet to actually complete an upgrade with the helm deployment.

budimanjojo · 2023-10-30T01:57:23Z

@4censord I use helm rollback nextcloud <revision> to get it back to the helm revision where the upgrade failed and get rid of the nextcloud error. Then exec into the pod and do php occ upgrade then php occ maintenance:mode --off

4censord · 2023-10-30T19:02:32Z

@budimanjojo Then i seem to have a different issue, that does not work for me. I have to roll back a backup after attempting an upgrade.

4censord · 2023-11-05T23:47:55Z

@provokateurin Would you mind taking a look at this once it's convenient?
IMO its ready.

jessebot · 2024-06-09T07:46:23Z

Sorry for the delay. Slowly making my rounds 🙏

@remram44 I did a quick look over and it seems ok, but don't we still want to be able to set upgrade in the values.yaml?

Also, there still needs to be a rebase, as there's conflicts.

jessebot · 2024-06-09T07:47:12Z

Also, tagged @provokateurin to get an additional review to ask about keeping the upgrade parameter in the values.yaml.

provokateurin · 2024-06-09T07:55:47Z

I'm not 100% sure if I read https://github.com/nextcloud/docker/blob/master/README.md correctly, but it sounds like the "normal" container will still try to do the upgrade as the default command is used there. Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.

4censord · 2024-06-09T07:56:25Z

IIRK the upgrade env var is used for triggering the Nextcloud images upgrade mechanism. But because this PR completely supersedes the internal upgrade on startup stuff by running the upgrades explicitly before, the upgrade env var does not do anything any more.

Also, there still needs to be a rebase, as there's conflicts.

Yeah, back in 2023 when we said its ready there weren't.

4censord · 2024-06-09T07:58:26Z

Then we have a race condition between the two containers and depending on which one gets to work first the upgrade is killed by the probes or not.

No, the init container runs first, and the normal container only gets started once the upgrade is completes.
If it had the upgrade var set, it just would not do anything because its already on the lasted version

provokateurin · 2024-06-09T07:59:47Z

You're right, I overlooked that it is an initContainer and not another sidecar. Then this makes sense to me, I'll have to give it some testing though to confirm it works as intended.

provokateurin · 2024-06-09T08:00:56Z

charts/nextcloud/templates/deployment.yaml

+        {{- if .Values.nextcloud.securityContext}}
+        securityContext:
+          {{- with .Values.nextcloud.securityContext }}


Could just be a single with instead of if+with

The initContainer is copied from the container with only required changes, I am not cleaning up the existing container at this time. It would only make this patch harder to review.

charts/nextcloud/templates/deployment.yaml

provokateurin · 2024-06-09T08:02:34Z

If you could do a rebase onto the latest main state then I'll give it a test.

provokateurin · 2024-06-09T09:03:18Z

Ah yeah I didn't look who submitted the PR and thought it was you 🙈

remram44 · 2024-06-09T17:36:55Z

I rebased.

I am not sure how this interacts with #466/#525, I moved running the hook to the initContainer only. @wrenix should probably weigh in.

wrenix · 2024-06-09T21:28:46Z

In my opinion with the hooks. It is a better solution to move it to the initContainer (that should work).

4censord · 2024-06-14T22:11:59Z

I'd say we should have multiple init-containers run in order.
That would make them both easier, and would help with debugging.
I'd have them in this order:

wait for DB to be up
Upgrade Nextcloud version
install and update apps
disable maintenance mode, possible more checks.

While that adds complexity in more containers, and will increase startup time per pod, IMHO it's then easier that writing complex scripts for doing everything within one container.

remram44 · 2024-06-14T23:48:31Z

I'm not entirely sure what the advantage is. Multiple init containers that run the exact same image... that also means they will have to release/acquire a lock multiple time.

I don't see what you mean be "decrease startup time per pod". At least as much work has to happen, so that doesn't seem true?

4censord · 2024-06-15T11:17:43Z

decrease startup time per pod

This was meant to say "increase", fixed now.

Every container that starts takes a few seconds until it's ready to run its scripts.
So if you run more containers sequentially, you have that delay multiple times until the container can check that it does not need to do anything.

Personally, I just would have done multiple containers, because that feels easier to do, rather than having to chain things in the same container.
But its not a problem for me, and i wont argue against doing it another way.

remram44 · 2024-07-12T19:57:08Z

Bump. Will this get merged any time soon? Will you ask me to rebase and ignore it again?

@jessebot @provokateurin please let me know if there is anything else you expect before merging, thanks

budimanjojo · 2024-07-13T02:52:40Z

It's unreal how long this PR is getting ignored and this is a real issue with the chart.

provokateurin · 2024-07-13T06:36:21Z

Please keep in mind this chart is only maintained by volunteers and nobody is paid for this.

budimanjojo · 2024-07-13T06:43:35Z

@provokateurin I understand. But so is the PR author which happens to be very cooperative in the rebase requests and then keep getting ignored afterwards (multiple times). This change is very small and you should just take a quick look, and decide whether this is a good addition or not and proceed to test if you accept this change or reject and close the PR otherwise. This should take like one or two hours and not years.

jessebot · 2024-07-22T07:38:10Z

This change is very small and you should just take a quick look, and decide whether this is a good addition or not and proceed to test if you accept this change or reject and close the PR otherwise. This should take like one or two hours and not years.

This changes the way updates work. It's unfortunately not as small as you may think, and because this is maintained by volunteers, it's when we have time to take a look at it.

Bump. Will this get merged any time soon? Will you ask me to rebase and ignore it again?
... please let me know if there is anything else you expect before merging, thanks

@remram44 Please also bump the helm chart version a major version, as this removes an option and defaults to allowing updates. In the future, please check out the checks at the bottom of the PR and you can see if there's any default checks that we'll come back and ask you to change, as if the checks of a PR are not passing, we cannot merge it, according to the greater nextcloud org rules. You can also find the contributing guidelines here.

Screenshot of the checks at the bottom of the PR, showing that the chart linting github workflow has failed and stating that the Merging is blocked

EDIT: I just tried to check this PR in an incognito window while logged out of GitHub and it didn't show the bottom checks section, but it does still show each commit, and if you see a little ❌ beside a commit, you should be able to click it and it will show you which check has failed. Sorry about that.

@provokateurin should we also add a note in the README that with this change, this chart will not auto-update anytime the tag is updates, so if you don't want that, you should manually specify the image.tag? 🤔 If not, I'm happy to merge this when the Chart.yaml has it's version bumped. Also, I can submit that PR for the doc change. It doesn't have to be in this PR.

jessebot

Requesting changes comment: Needs Version bumped in Chart.yaml.

provokateurin · 2024-07-22T14:20:44Z

this chart will not auto-update anytime the tag is updates, so if you don't want that, you should manually specify the image.tag?

I didn't dive deep into this PR, but to my understanding it doesn't change this behavior? It only changes how the update is performed, or am I missing something?

budimanjojo · 2024-07-22T14:46:56Z

This PR only move the update process to an initContainer instead of in the main container that has probes. Those probes kill the container everytime when the upgrade process takes too long that will cause gitops tools like FluxCD and ArgoCD to rollback the chart because of the failed probes. And Nextcloud will refuse to start even after rolling back because of version mismatch when the container being killed mid upgrade.

budimanjojo · 2024-07-22T14:52:22Z

This changes the way updates work. It's unfortunately not as small as you may think, and because this is maintained by volunteers, it's when we have time to take a look at it.

No the changes doesn't change the way updates work, it just move that job to another container instead. And I apologize for being such a jerk in the complaint but I have been dealing with this problem for too long.

remram44 · 2024-07-22T14:56:01Z

Bumped to 6.0.0

remram44 · 2024-07-23T00:17:36Z

@jessebot I can see the checks at the bottom, however they do not run until the workflow is approved:

1 workflow awaiting approval
This workflow requires approval from a maintainer.
3 expected and 1 successful checks

I am not going to check this page everyday to see if maybe they got approved and there are results to see...

jessebot · 2024-07-23T07:08:52Z

I didn't dive deep into this PR, but to my understanding it doesn't change this behavior? It only changes how the update is performed, or am I missing something?

It removes the option to set the update flag and always sets it going forward, but perhaps I've misunderstood. @provokateurin could you please keep helping here?

no longer handling this

This should not have been exposed. There is no way to use it, since you can't pass a custom command to the container. Signed-off-by: Remi Rampin <[email protected]>

Signed-off-by: Remi Rampin <[email protected]>

remram44 · 2024-07-26T17:46:10Z

The NEXTCLOUD_UPDATE is actually a little tricky. It is does not enable/disable updating. For that reason, in my view it should have never been expose by the chart.

The update always happens automatically, unless you run the container with a custom command (not the case of this chart), in which case it won't update unless NEXTCLOUD_UPDATE=1.

4censord approved these changes Oct 26, 2023

View reviewed changes

remram44 force-pushed the upgrade-avoid-probes branch from 11f5cf8 to c65efeb Compare October 26, 2023 15:33

4censord approved these changes Oct 26, 2023

View reviewed changes

budimanjojo approved these changes Oct 29, 2023

View reviewed changes

jessebot requested a review from provokateurin June 9, 2024 07:46

jessebot mentioned this pull request Jun 9, 2024

Liveness probe kills pod while upgrading #333

Open

provokateurin requested changes Jun 9, 2024

View reviewed changes

remram44 force-pushed the upgrade-avoid-probes branch from c65efeb to 0cf0167 Compare June 9, 2024 17:35

provokateurin mentioned this pull request Jul 1, 2024

Failed opening required '/var/www/html/lib/versioncheck.php' #584

Open

jessebot requested review from provokateurin and jessebot July 22, 2024 07:39

jessebot previously requested changes Jul 22, 2024

View reviewed changes

remram44 force-pushed the upgrade-avoid-probes branch from 3acbf1e to c1b9fe0 Compare July 22, 2024 14:55

jessebot assigned remram44 Jul 26, 2024

remram44 added 3 commits July 26, 2024 12:04

Remove nextcloud.update from values

c54e98b

This should not have been exposed. There is no way to use it, since you can't pass a custom command to the container. Signed-off-by: Remi Rampin <[email protected]>

Add an init container to do the upgrade

affaea2

Signed-off-by: Remi Rampin <[email protected]>

Major bump to chart version: 6.0.0

3948370

Signed-off-by: Remi Rampin <[email protected]>

remram44 force-pushed the upgrade-avoid-probes branch from 23b861c to 3948370 Compare July 26, 2024 16:12

Avoid upgrade being killed by failed liveness probes #344

Are you sure you want to change the base?

Avoid upgrade being killed by failed liveness probes #344

Conversation

remram44 commented Jan 27, 2023 • edited Loading

Pull Request

Description of the change

Benefits

Possible drawbacks

Applicable issues

Additional information

Checklist

jessebot commented Jan 27, 2023

remram44 commented Jan 27, 2023

remram44 commented Feb 9, 2023

Jeroen0494 commented Mar 4, 2023 • edited Loading

Jeroen0494 commented Mar 4, 2023

remram44 commented Mar 4, 2023

4censord commented Oct 26, 2023

provokateurin commented Oct 26, 2023

4censord commented Oct 26, 2023

4censord commented Oct 26, 2023

remram44 commented Oct 26, 2023

budimanjojo commented Oct 29, 2023

4censord commented Oct 29, 2023

budimanjojo commented Oct 30, 2023

4censord commented Oct 30, 2023

4censord commented Nov 5, 2023

jessebot commented Jun 9, 2024

jessebot commented Jun 9, 2024

provokateurin commented Jun 9, 2024

4censord commented Jun 9, 2024

4censord commented Jun 9, 2024

provokateurin commented Jun 9, 2024

provokateurin Jun 9, 2024

Choose a reason for hiding this comment

remram44 Jun 9, 2024

Choose a reason for hiding this comment

provokateurin commented Jun 9, 2024

provokateurin commented Jun 9, 2024

remram44 commented Jun 9, 2024

wrenix commented Jun 9, 2024

4censord commented Jun 14, 2024 • edited Loading

remram44 commented Jun 14, 2024

4censord commented Jun 15, 2024 • edited Loading

remram44 commented Jul 12, 2024

budimanjojo commented Jul 13, 2024

provokateurin commented Jul 13, 2024

budimanjojo commented Jul 13, 2024

jessebot commented Jul 22, 2024 • edited Loading

jessebot left a comment

Choose a reason for hiding this comment

provokateurin commented Jul 22, 2024

budimanjojo commented Jul 22, 2024

budimanjojo commented Jul 22, 2024

remram44 commented Jul 22, 2024

remram44 commented Jul 23, 2024

jessebot commented Jul 23, 2024 • edited Loading

remram44 commented Jul 26, 2024

remram44 commented Jan 27, 2023 •

edited

Loading

Jeroen0494 commented Mar 4, 2023 •

edited

Loading

4censord commented Jun 14, 2024 •

edited

Loading

4censord commented Jun 15, 2024 •

edited

Loading

jessebot commented Jul 22, 2024 •

edited

Loading

jessebot commented Jul 23, 2024 •

edited

Loading