-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Startup time for galaxy web container can take very long #391
Comments
This is an unfortunate known issue, although I just now realized we never documented it in this repo. And while we are experiencing the same slow-start scenario, we've not figured out the root cause yet. One of the hypotheses is that s3fs or S3-CSI is doing a GET for each reference file (or cache validation) and hence this is taking so long. We've tried mounting the ref data bucket with additional mount options to see if that might affect the Galaxy startup performance, but it has not. Without doing a deeper dive into s3fs, I don't think we'll get to the bottom of this. An alternative to s3fs is to go back CVMFS-CSI, but that means we'll have to update it to work with K8s 1.21+, which means we're basically taking on maintenance of that project because it seems unmaintained upstream. This could possibly be partially relieved within Galaxy startup because it's seems Galaxy is waiting on something from the reference folder (possibly inspecting all the If you have any other ideas or make any discoveries, please share. |
Do you see it stopping at the same places as I see it, or it varies for you? Thanks the insights @afgane . |
You are using s3fs for tools and reference data I'm guessing? anything else? |
The job and workflow pods normally make it after one or two restarts, however today web is not going through even after 4 or 5 restarts. Do you alter the current defaults for readiness / liveness to allow web to go through? I think it is killing it every 8 minutes or so. |
Seen from the inside, the web container seems completely idle, not even IO waits:
|
There is always at least one web, job or workflow container that systematically doesn't satisfy its probes, and gets killed eventually with an error code 143 (nothing showing an error in the Galaxy logs). If I keep doing helm upgrades to the running instance, as the older pods go away, that intermediate one (which was trying to make it through) succeeds, but then one of the newer revision pods starts to struggle. Could it be a race condition for some resources? |
Have you tried changing the s3 endopoint to something geographically closer? I see it is set by default to Asia Pacific ( galaxy-helm/galaxy/values.yaml Line 324 in e5d9830
|
Also, when bringing it down, the CSI S3 PV doesn't come down cleanly, it stays in terminating for very long:
|
So with eu-west-2 on the mount options and secret, I get |
The bucket with reference data is available only in ap-southeast so it won't work with any other endpoint. And it's only the ref data that is being fetched from the bucket; tool definitions are coming from a different (Google) bucket (via an Action defined here https://github.com/anvilproject/cvmfs-cloud-clone). Re. the data locality - we've launched instances in the same region as the data but with no observable difference in startup time. A reason S3 CSI may not be coming down on delete is due to And re. the places where the Galaxy pods pause - it's the same spot as you captured. What you describe is what we've experienced as well: after about the 3rd restart all the Galaxy pods come up. |
Are you still seeing this? One or two restarts is expected (not ideal, but we see the same thing), but there is something else going wrong if the web handler doesn't come up after a few restarts. Can you tell if the initContainers have completed? ~8 minutes sound about right for the readiness probe to timeout. |
I left it all night and the web container kept restarting. Then I turned off s3csi (I don't really need reference data on that instance). I will give it another go and report back. |
But yes, the init containers had finished. The only error I could find was the 134 code that you could see in the container part of the kubectl describe, but I suspect that comes from the sigterm that the probe must be triggering. Thanks your taking the time to go through this guys. |
I'm seeing the need to 2 restarts on an installation (and upgrades) that have no S3 CSI usage... stopping always on the sqlite reads. So maybe this is being produced by something else?
could it be something performance related from opening sqlite databases from shared file systems? Are those used concurrently by all web, job, workflow handlers? Or could we chuck them individually in local filesystems instead? |
@pcm32 You're using NFS ganesha right? To test this hypothesis, could you try mapping in a local host mount for |
Sure, will give it a try and report. Yes, using NFS ganesha. |
@pcm32 This PR: #396 should solve the container startup speed issue, but it's awaiting a merge of: CloudVE/galaxy-cvmfs-csi-helm#16 |
I noticed now I haven't yet tried this. This would require a different mount per container though (web, workflow, job at least). Can this be expressed in extraVolumeMounts? So far I have only seen shared file systems used here. |
Yes, it should it should be possible to use an |
First, let me say how great this setup has become, you have done a fantastic job guys!
On a more or less unmodified setup, I often observe that the galaxy web, job and workflow containers can take quite a while (above some 8 minutes) to start. Looking at the logs I often see them stucked in a couple of places:
to the point where the container is killed as it runs out of time given by the probes I presume.
Would this start be improved if less tools get installed? Or could this be an issue of a slow shared file system? Or maybe more RAM should be given to the container? Do you see this as well? Thanks!
The text was updated successfully, but these errors were encountered: