Startup time for galaxy web container can take very long #391

pcm32 · 2022-12-01T18:46:55Z

First, let me say how great this setup has become, you have done a fantastic job guys!

On a more or less unmodified setup, I often observe that the galaxy web, job and workflow containers can take quite a while (above some 8 minutes) to start. Looking at the logs I often see them stucked in a couple of places:

sqlitedict INFO 2022-12-01 18:35:34,188 [pN:main,p:8,tN:MainThread] opening Sqlite table 'unnamed' in '/galaxy/server/database/cache/tool_cache/cache.sqlite'

galaxy.tools.data DEBUG 2022-12-01 18:42:43,645 [pN:main,p:8,tN:MainThread] Loaded tool data table 'q2view_display' from file '/galaxy/server/config/mutable/shed_tool_data_table_conf.xml'

to the point where the container is killed as it runs out of time given by the probes I presume.

Would this start be improved if less tools get installed? Or could this be an issue of a slow shared file system? Or maybe more RAM should be given to the container? Do you see this as well? Thanks!

The text was updated successfully, but these errors were encountered:

afgane · 2022-12-01T19:14:05Z

This is an unfortunate known issue, although I just now realized we never documented it in this repo. And while we are experiencing the same slow-start scenario, we've not figured out the root cause yet. One of the hypotheses is that s3fs or S3-CSI is doing a GET for each reference file (or cache validation) and hence this is taking so long. We've tried mounting the ref data bucket with additional mount options to see if that might affect the Galaxy startup performance, but it has not. Without doing a deeper dive into s3fs, I don't think we'll get to the bottom of this. An alternative to s3fs is to go back CVMFS-CSI, but that means we'll have to update it to work with K8s 1.21+, which means we're basically taking on maintenance of that project because it seems unmaintained upstream.

This could possibly be partially relieved within Galaxy startup because it's seems Galaxy is waiting on something from the reference folder (possibly inspecting all the .loc files). A questions is whether it has to do this during startup. We've created a bucket that contains only Galaxy's .loc files to test this hypothesis but there seems to be a bug in the Helm chart that we haven't discovered yet and it is preventing us from actually testing this.

If you have any other ideas or make any discoveries, please share.

pcm32 · 2022-12-01T19:21:32Z

Do you see it stopping at the same places as I see it, or it varies for you? Thanks the insights @afgane .

pcm32 · 2022-12-01T19:24:40Z

You are using s3fs for tools and reference data I'm guessing? anything else?

pcm32 · 2022-12-01T19:34:09Z

The job and workflow pods normally make it after one or two restarts, however today web is not going through even after 4 or 5 restarts. Do you alter the current defaults for readiness / liveness to allow web to go through? I think it is killing it every 8 minutes or so.

pcm32 · 2022-12-01T21:47:21Z

Seen from the inside, the web container seems completely idle, not even IO waits:

top - 21:46:07 up 15 days,  6:33,  0 users,  load average: 1.59, 1.29, 1.15
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.0 us,  1.5 sy,  0.3 ni, 97.1 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  16008.3 total,   1846.8 free,   2509.2 used,  11652.3 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  13136.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 galaxy    20   0    2380    576    504 S   0.0   0.0   0:00.04 tini
      7 galaxy    20   0    2484    592    512 S   0.0   0.0   0:00.00 sh
      8 galaxy    20   0  903180 243924  51296 D   0.0   1.5   0:05.64 gunicorn
     22 galaxy    20   0    7164   3972   3368 S   0.0   0.0   0:00.03 bash
     31 galaxy    20   0    9984   3728   3176 R   0.0   0.0   0:00.00 top

pcm32 · 2022-12-01T22:25:21Z

There is always at least one web, job or workflow container that systematically doesn't satisfy its probes, and gets killed eventually with an error code 143 (nothing showing an error in the Galaxy logs). If I keep doing helm upgrades to the running instance, as the older pods go away, that intermediate one (which was trying to make it through) succeeds, but then one of the newer revision pods starts to struggle. Could it be a race condition for some resources?

pcm32 · 2022-12-02T09:51:27Z

Have you tried changing the s3 endopoint to something geographically closer? I see it is set by default to Asia Pacific (

galaxy-helm/galaxy/values.yaml

Line 324 in e5d9830

    
           mountOptions: "-o use_cache=/tmp -o endpoint=ap-southeast-2 -o public_bucket=1 -o enable_noobj_cache -o no_check_certificate -o kernel_cache -o ensure_diskfree=5000"

). Can I use just any regional endpoint or is this available only there?

pcm32 · 2022-12-02T10:00:54Z

Also, when bringing it down, the CSI S3 PV doesn't come down cleanly, it stays in terminating for very long:

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS        CLAIM                                                     STORAGECLASS           REASON   AGE
galaxy-dev-refdata-gxy-pv                  30Gi       ROX            Delete           Terminating   default/galaxy-dev-refdata-gxy-data-pvc                   refdata-gxy-data                15d

pcm32 · 2022-12-02T10:32:29Z

So with eu-west-2 on the mount options and secret, I get CrashLoopBackOff on the 3 galaxy containers, so I guess the data is not available there.

afgane · 2022-12-02T14:23:26Z

The bucket with reference data is available only in ap-southeast so it won't work with any other endpoint. And it's only the ref data that is being fetched from the bucket; tool definitions are coming from a different (Google) bucket (via an Action defined here https://github.com/anvilproject/cvmfs-cloud-clone). Re. the data locality - we've launched instances in the same region as the data but with no observable difference in startup time.

A reason S3 CSI may not be coming down on delete is due to finalizers being added, by Helm I believe, on the PVC. If you remove that, the pods will come down.

And re. the places where the Galaxy pods pause - it's the same spot as you captured. What you describe is what we've experienced as well: after about the 3rd restart all the Galaxy pods come up.

ksuderman · 2022-12-02T17:24:36Z

The job and workflow pods normally make it after one or two restarts, however today web is not going through even after 4 or 5 restarts. Do you alter the current defaults for readiness / liveness to allow web to go through? I think it is killing it every 8 minutes or so.

Are you still seeing this? One or two restarts is expected (not ideal, but we see the same thing), but there is something else going wrong if the web handler doesn't come up after a few restarts. Can you tell if the initContainers have completed? ~8 minutes sound about right for the readiness probe to timeout.

pcm32 · 2022-12-02T17:39:44Z

I left it all night and the web container kept restarting. Then I turned off s3csi (I don't really need reference data on that instance). I will give it another go and report back.

pcm32 · 2022-12-02T17:56:14Z

But yes, the init containers had finished. The only error I could find was the 134 code that you could see in the container part of the kubectl describe, but I suspect that comes from the sigterm that the probe must be triggering. Thanks your taking the time to go through this guys.

pcm32 · 2022-12-05T14:13:20Z

I'm seeing the need to 2 restarts on an installation (and upgrades) that have no S3 CSI usage... stopping always on the sqlite reads. So maybe this is being produced by something else?

sqlitedict INFO 2022-12-05 14:10:09,371 [pN:main,p:9,tN:MainThread] opening Sqlite table 'unnamed' in '/galaxy/server/database/cache/tool_cache/cache.sqlite'

could it be something performance related from opening sqlite databases from shared file systems? Are those used concurrently by all web, job, workflow handlers? Or could we chuck them individually in local filesystems instead?

nuwang · 2022-12-05T14:18:31Z

@pcm32 You're using NFS ganesha right? To test this hypothesis, could you try mapping in a local host mount for /galaxy/server/database/cache/tool_cache/ using extraVolumes and extraVolumeMounts?

pcm32 · 2022-12-05T14:56:42Z

Sure, will give it a try and report. Yes, using NFS ganesha.

nuwang · 2022-12-16T13:21:51Z

@pcm32 This PR: #396 should solve the container startup speed issue, but it's awaiting a merge of: CloudVE/galaxy-cvmfs-csi-helm#16

pcm32 · 2023-03-10T15:55:13Z

@pcm32 You're using NFS ganesha right? To test this hypothesis, could you try mapping in a local host mount for /galaxy/server/database/cache/tool_cache/ using extraVolumes and extraVolumeMounts?

I noticed now I haven't yet tried this. This would require a different mount per container though (web, workflow, job at least). Can this be expressed in extraVolumeMounts? So far I have only seen shared file systems used here.

nuwang · 2023-03-10T17:42:13Z

Yes, it should it should be possible to use an emptyDir mount to create a container local mount?

pcm32 mentioned this issue Jan 6, 2023

Occasional issues with sqlite tool cache #406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup time for galaxy web container can take very long #391

Startup time for galaxy web container can take very long #391

pcm32 commented Dec 1, 2022 •

edited

Loading

afgane commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 2, 2022

afgane commented Dec 2, 2022

ksuderman commented Dec 2, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 5, 2022 •

edited

Loading

nuwang commented Dec 5, 2022

pcm32 commented Dec 5, 2022 •

edited

Loading

nuwang commented Dec 16, 2022

pcm32 commented Mar 10, 2023

nuwang commented Mar 10, 2023

Startup time for galaxy web container can take very long #391

Startup time for galaxy web container can take very long #391

Comments

pcm32 commented Dec 1, 2022 • edited Loading

afgane commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 1, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 2, 2022

afgane commented Dec 2, 2022

ksuderman commented Dec 2, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 2, 2022

pcm32 commented Dec 5, 2022 • edited Loading

nuwang commented Dec 5, 2022

pcm32 commented Dec 5, 2022 • edited Loading

nuwang commented Dec 16, 2022

pcm32 commented Mar 10, 2023

nuwang commented Mar 10, 2023

pcm32 commented Dec 1, 2022 •

edited

Loading

pcm32 commented Dec 5, 2022 •

edited

Loading

pcm32 commented Dec 5, 2022 •

edited

Loading