Multiple/many parallel jobs lead to "random" failures #490

mapk-amazon · 2024-07-23T19:27:13Z

Setup

The setup is deployed on AWS on EKS:

Version k8s: 1.28
Version Helm Chart: v5.9.0

Issue

Galaxy "usually" deploys jobs just fine. We started importing with Batch files into Galaxy and experience random failures of pods.

Logs

galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,109 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-g674v
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,120 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-zpgqx
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,130 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-7kl4g
galaxy.jobs.runners.kubernetes ERROR 2024-07-23 15:35:40,159 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] No Jobs are available under expected selector app=gxy-galaxy-g9ts6

and

requests.exceptions.HTTPError: 409 Client Error: Conflict for url: [https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-f4b62](https://172.20.0.1/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-f4b62)
pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-galaxy-f4b62": the object has been modified; please apply your changes to the latest version and try again

In the k8s log we also see that the pods was launched around the time:

gxy-galaxy-f4b62-95mlz               0/1     ContainerCreating   0          1s
gxy-galaxy-f4b62-95mlz               1/1     Running             0          4s
gxy-galaxy-f4b62-95mlz               0/1     Completed           0          7s

Ideas/Hypothesis

Current ideas are that the hash (e.g. f4b62) has a collision and leads to resource conflicts for the pods and to failures of some jobs.

Does the team has any experience with it? Any fixes? Thank you :)

The text was updated successfully, but these errors were encountered:

ksuderman · 2024-07-24T02:15:12Z

We recently received a similar report and I originally thought it may be related to Kubernetes 1.30 and the pykube-ng version we use. However, you are using 1.28 and I have been unable to recreate the problem. The one common thread is EKS. I will investigate that next.

See galaxyproject/galaxy#18567

mapk-amazon · 2024-07-29T11:24:28Z

I can test with various EKS versions, however, I am not sure how to build a minimal example withpykube-ng. If you have a snippet to produce a similar effect to Galaxy job scheduling, I can test it and report back :)

nuwang · 2024-07-29T13:05:43Z

Is there a stack trace? Or can the verbosity level be increased to produce one? If not, I think we have a problem with the error being inadequately logged, and we need to figure out which line of code is generating the exception.

Most likely, this is caused by a race condition between k8s modifying the job status, and the runner attempting to read and modify the manifest itself. As mentioned earlier, the resultant hash collision of resourceVersion would cause this conflict. So if we re-queue the current task whenever this error is encountered, the runner thread should eventually fetch the latest version, and succeed I would expect.

mapk-amazon · 2024-07-29T13:49:50Z

This is "the most" detailed log I get:

galaxy-job-0 galaxy.jobs.runners.kubernetes ERROR 2024-07-29 13:46:58,387 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring...                                 │
│ galaxy-job-0 Traceback (most recent call last):                                                                                                                                                                   │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 403, in raise_for_status                                                                                             │
│ galaxy-job-0     resp.raise_for_status()                                                                                                                                                                          │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1021, in raise_for_status                                                                                        │
│ galaxy-job-0     raise HTTPError(http_error_msg, response=self)                                                                                                                                                   │
│ galaxy-job-0 requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-4db2n                                                      │
│ galaxy-job-0                                                                                                                                                                                                      │
│ galaxy-job-0 During handling of the above exception, another exception occurred:                                                                                                                                  │
│ galaxy-job-0                                                                                                                                                                                                      │
│ galaxy-job-0 Traceback (most recent call last):                                                                                                                                                                   │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 872, in _handle_job_failure                                                                                                      │
│ galaxy-job-0     self.__cleanup_k8s_job(job)                                                                                                                                                                      │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 879, in __cleanup_k8s_job                                                                                                        │
│ galaxy-job-0     delete_job(job, k8s_cleanup_job)                                                                                                                                                                 │
│ galaxy-job-0   File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 108, in delete_job                                                                                                         │
│ galaxy-job-0     job.scale(replicas=0)                                                                                                                                                                            │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 31, in scale                                                                                                       │
│ galaxy-job-0     self.update()                                                                                                                                                                                    │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update                                                                                                    │
│ galaxy-job-0     self.patch(self.obj, subresource=subresource)                                                                                                                                                    │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch                                                                                                     │
│ galaxy-job-0     self.api.raise_for_status(r)                                                                                                                                                                     │
│ galaxy-job-0   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 410, in raise_for_status                                                                                             │
│ galaxy-job-0     raise HTTPError(resp.status_code, payload["message"])                                                                                                                                            │
│ galaxy-job-0 pykube.exceptions.HTTPError: Operation cannot be fulfilled on jobs.batch "gxy-galaxy-4db2n": the object has been modified; please apply your changes to the latest version and try again

nuwang · 2024-07-30T13:02:51Z

Thanks. That helps with narrowing things down.

ksuderman · 2024-07-30T16:25:24Z

To change/update the pykube-ng version requires building a new galaxy-min docker image. I have limited internet connectivity at the moment so it is not easy for me to build and push a new image right now, but I'll try to get that done in the next few days.

mapk-amazon · 2024-08-06T07:26:57Z

How do you build the galaxy-min docker image?

Is it building this as-is, or is there a "min" configuration somewhere?

nuwang · 2024-08-06T16:20:03Z

@mapk-amazon That's the right image. Building it as is will do the job. If you'd like to test the changes, please try this branch: galaxyproject/galaxy#18514
This has some fixes, including the pykube upgrade that may solve this issue.

almahmoud · 2024-08-06T16:23:51Z

Fwiw @mapk-amazon , you can also use ghcr.io/bioconductor/galaxy:dev which is the built image from that PR.

mapk-amazon · 2024-08-06T18:16:30Z

Thank you all. I used ghcr.io/bioconductor/galaxy:dev, otherwise the same setup as in the start. I uploaded 100x 1MB files with random content. It failed for 2 with the same error:

│ galaxy.jobs.runners.kubernetes ERROR 2024-08-06 18:06:32,493 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Could not clean up k8s batch job. Ignoring...                                                                                            │
│ Traceback (most recent call last):                                                                                                                                                                                                                              │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 437, in raise_for_status                                                                                                                                                        │
│     resp.raise_for_status()                                                                                                                                                                                                                                     │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status                                                                                                                                                   │
│     raise HTTPError(http_error_msg, response=self)                                                                                                                                                                                                              │
│ requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://172.20.0.1:443/apis/batch/v1/namespaces/galaxy/jobs/gxy-galaxy-vnjqk                                                                                                                 │
│                                                                                                                                                                                                                                                                 │
│ During handling of the above exception, another exception occurred:                                                                                                                                                                                             │
│                                                                                                                                                                                                                                                                 │
│ Traceback (most recent call last):                                                                                                                                                                                                                              │
│   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 912, in _handle_job_failure                                                                                                                                                                 │
│     self.__cleanup_k8s_job(job)                                                                                                                                                                                                                                 │
│   File "/galaxy/server/lib/galaxy/jobs/runners/kubernetes.py", line 919, in __cleanup_k8s_job                                                                                                                                                                   │
│     delete_job(job, k8s_cleanup_job)                                                                                                                                                                                                                            │
│   File "/galaxy/server/lib/galaxy/jobs/runners/util/pykube_util.py", line 115, in delete_job                                                                                                                                                                    │
│     job.scale(replicas=0)                                                                                                                                                                                                                                       │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/mixins.py", line 30, in scale                                                                                                                                                                  │
│     self.update()                                                                                                                                                                                                                                               │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 165, in update                                                                                                                                                               │
│     self.patch(self.obj, subresource=subresource)                                                                                                                                                                                                               │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/objects.py", line 157, in patch                                                                                                                                                                │
│     self.api.raise_for_status(r)                                                                                                                                                                                                                                │
│   File "/galaxy/server/.venv/lib/python3.12/site-packages/pykube/http.py", line 444, in raise_for_status                                                                                                                                                        │
│     raise HTTPError(resp.status_code, payload["message"])

ksuderman · 2024-08-16T15:47:00Z

Thanks @mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)?

pcm32 · 2024-08-16T16:15:19Z

While this is shown as an error in the logs, I think that the behaviour of the code is harmless. Do you actually see the failure in the UI? That is why we added that "ignoring" part there.

…

On Fri, 16 Aug 2024, 16:47 Keith Suderman, ***@***.***> wrote: Thanks @mapk-amazon <https://github.com/mapk-amazon>, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)? — Reply to this email directly, view it on GitHub <#490 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

pcm32 · 2024-08-16T16:17:03Z

When running hundreds of jobs, you're are always bound to get some arbitrary errors, we mitigate that in our use of the setup with aggressive resubmission policies.

…

On Fri, 16 Aug 2024, 17:15 Pablo Moreno, ***@***.***> wrote: While this is shown as an error in the logs, I think that the behaviour of the code is harmless. Do you actually see the failure in the UI? That is why we added that "ignoring" part there. On Fri, 16 Aug 2024, 16:47 Keith Suderman, ***@***.***> wrote: > Thanks @mapk-amazon <https://github.com/mapk-amazon>, it sure looks like > a race condition. How did you upload the 100 files? Through the UI, API, or > other means (bioblend etc)? > > — > Reply to this email directly, view it on GitHub > <#490 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE> > . > You are receiving this because you are subscribed to this thread.Message > ID: ***@***.***> >

mapk-amazon · 2024-08-16T16:19:50Z

Thank you for your input!

@ksuderman I use the webinterface. I can try the API if you think it makes a difference.
@pcm32 Yes, the job fails. It looks like this on the UI then.

pcm32 · 2024-08-16T16:45:35Z

But yes, I do see this error every now and then in our logs, maybe I don't see it in the UI as an error due to the resubmissions.

ksuderman · 2024-08-16T17:35:11Z

When running hundreds of jobs, you're are always bound to get some arbitrary errors, we mitigate that in our use of the setup with aggressive resubmission policies.

True, but we are getting reports of the 409 Client Error errors from other users even with only a handful of jobs, but I've never been able to recreate the error myself. I do get occasional failures when running lots of jobs, but I don't recall them being a 409. I am hoping to find a common underlying cause

@mapk-amazon no need to try the API, I just want to make sure I am using the same procedure when I try to recreate the problem..

mapk-amazon · 2024-10-14T20:31:31Z

Update : I believe I know now what is happening. In my understanding the aggressive "retries" are the root cause of the issues.

The job pod (the one scheduling the pods) shows for failing pods, that "Galaxy" receives twice the information about the pod.

DEBUG 2024-10-14 20:12:36,480 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded
DEBUG 2024-10-14 20:12:38,484 [pN:job_handler_0,p:8,tN:KubernetesRunner.monitor_thread] Job id: gxy-galaxy-dkpc5 with k8s id: gxy-galaxy-dkpc5 succeeded

Then it starts cleaning (twice) and one fails, as the other one already deleted/starting deletion. Finally, it shows tool_stdout and tool_stderr twice:

DEBUG 2024-10-14 20:12:54,185 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stdout: 
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stdout:

DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) tool_stderr: 
DEBUG 2024-10-14 20:12:54,186 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (3464/gxy-galaxy-dkpc5) job_stderr: Job output not returned from cluster

It seems the first job moved the data already and the second did no longer found the file.

The result is a technically successful job (as the container finished), the results were processed successfully once, and the second iteration (the later one) responds with an error and Galaxy believes the job fails.

mapk-amazon · 2024-10-15T15:50:06Z

Update 2: I believe I was wrong (yet again). Please take a look at the PR galaxyproject/galaxy#19001 :)

afgane mentioned this issue Aug 27, 2024

Add a retry when deleting a k8s job galaxyproject/galaxy#18744

Merged

2 tasks

mapk-amazon mentioned this issue Oct 15, 2024

Fixes random job failures in kubernetes galaxyproject/galaxy#19001

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple/many parallel jobs lead to "random" failures #490

Multiple/many parallel jobs lead to "random" failures #490

mapk-amazon commented Jul 23, 2024

ksuderman commented Jul 24, 2024

mapk-amazon commented Jul 29, 2024

nuwang commented Jul 29, 2024

mapk-amazon commented Jul 29, 2024

nuwang commented Jul 30, 2024

ksuderman commented Jul 30, 2024

mapk-amazon commented Aug 6, 2024

nuwang commented Aug 6, 2024

almahmoud commented Aug 6, 2024

mapk-amazon commented Aug 6, 2024

ksuderman commented Aug 16, 2024

pcm32 commented Aug 16, 2024 via email

pcm32 commented Aug 16, 2024 via email

mapk-amazon commented Aug 16, 2024

pcm32 commented Aug 16, 2024

ksuderman commented Aug 16, 2024

mapk-amazon commented Oct 14, 2024

mapk-amazon commented Oct 15, 2024

Multiple/many parallel jobs lead to "random" failures #490

Multiple/many parallel jobs lead to "random" failures #490

Comments

mapk-amazon commented Jul 23, 2024

Setup

Issue

Logs

Ideas/Hypothesis

ksuderman commented Jul 24, 2024

mapk-amazon commented Jul 29, 2024

nuwang commented Jul 29, 2024

mapk-amazon commented Jul 29, 2024

nuwang commented Jul 30, 2024

ksuderman commented Jul 30, 2024

mapk-amazon commented Aug 6, 2024

nuwang commented Aug 6, 2024

almahmoud commented Aug 6, 2024

mapk-amazon commented Aug 6, 2024

ksuderman commented Aug 16, 2024

pcm32 commented Aug 16, 2024 via email

pcm32 commented Aug 16, 2024 via email

mapk-amazon commented Aug 16, 2024

pcm32 commented Aug 16, 2024

ksuderman commented Aug 16, 2024

mapk-amazon commented Oct 14, 2024

mapk-amazon commented Oct 15, 2024