Run out of file locks after an hour of running (around ~100 jobs) #399

pcm32 · 2022-12-15T11:08:15Z

After around an hour of running I'm starting to get all sort of file locking errors in the main job handler:

galaxy.tool_util.deps.containers ERROR 2022-12-15 10:34:46,234 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] Could not get container description for tool 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_compute_graph/scanpy_compute_graph/1.8.1+2+galaxy0'
Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/tool_util/deps/containers.py", line 320, in find_best_container_description
    resolved_container_description = self.resolve(enabled_container_types, tool_info, **kwds)
  File "/galaxy/server/lib/galaxy/tool_util/deps/containers.py", line 351, in resolve
    container_description = container_resolver.resolve(
  File "/galaxy/server/lib/galaxy/tool_util/deps/container_resolvers/mulled.py", line 557, in resolve
    name = targets_to_mulled_name(
  File "/galaxy/server/lib/galaxy/tool_util/deps/container_resolvers/mulled.py", line 361, in targets_to_mulled_name
    tags = mulled_tags_for(namespace, target.package_name, resolution_cache=resolution_cache, session=session)
  File "/galaxy/server/lib/galaxy/tool_util/deps/mulled/util.py", line 127, in mulled_tags_for
    if not _namespace_has_repo_name(namespace, image, resolution_cache):
  File "/galaxy/server/lib/galaxy/tool_util/deps/mulled/util.py", line 104, in _namespace_has_repo_name
    cached_namespace = preferred_resolution_cache.get(cache_key)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/cache.py", line 322, in get
    return self._get_value(key, **kw).get_value()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 332, in get_value
    self.namespace.acquire_read_lock()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/container.py", line 206, in acquire_read_lock
    self.access_lock.acquire_read_lock()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/synchronization.py", line 154, in acquire_read_lock
    x = self.do_acquire_read_lock(wait)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/beaker/synchronization.py", line 257, in do_acquire_read_lock
    fcntl.flock(filedescriptor, fcntl.LOCK_SH)
OSError: [Errno 37] No locks available

mostly when finding the container descriptor (I wonder if that could be cached, or if this is the cache). But also when it sets some metadata fields:

galaxy.jobs.runners ERROR 2022-12-15 10:33:50,925 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-1] (711/gxy-galaxy-dev-5wclv) Job wrapper finish method failed
Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/jobs/runners/__init__.py", line 628, in _finish_or_resubmit_job
    job_wrapper.finish(
  File "/galaxy/server/lib/galaxy/jobs/__init__.py", line 1850, in finish
    self._finish_dataset(output_name, dataset, job, context, final_job_state, remote_metadata_directory)
  File "/galaxy/server/lib/galaxy/jobs/__init__.py", line 1668, in _finish_dataset
    dataset.datatype.set_meta(dataset, overwrite=False)
  File "/galaxy/server/lib/galaxy/datatypes/binary.py", line 1347, in set_meta
    with h5py.File(dataset.file_name, "r") as anndata_file:
  File "/galaxy/server/.venv/lib/python3.10/site-packages/h5py/_hl/files.py", line 533, in __init__
    fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/h5py/_hl/files.py", line 226, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: [Errno 37] Unable to open file (unable to lock file, errno = 37, error message = 'No locks available')

I wonder if some of the elements in the database directory could be moved out of the shared file system and this could alleviate the amount of locks requested? Or this would be an OS wide issue on the container?

galaxy@galaxy-dev-job-0-75bc974db-zplkm:/galaxy/server/database$ du -h --max-depth=1 .
14M	./tool_search_index
1.5M	./config
5.1M	./cache
1.7G	./cvmfsclone
4.0K	./tmp
4.0K	./deps
4.0K	./object_store_cache
655M	./jobs_directory
7.7G	./shed_tools
3.8M	./tools
21G	./objects
2.1M	./tool-data
31G	.

which one of this could live in RWO volumes for instance? Thanks!

The text was updated successfully, but these errors were encountered:

nuwang · 2022-12-15T12:51:22Z

@pcm32 What if HDF5_USE_FILE_LOCKING=FALSE is set? https://stackoverflow.com/questions/57310333/can-we-disable-h5py-file-locking-for-python-file-like-object
Since Galaxy doesn't really have multiple writers on the same file, this should be safe?

pcm32 · 2022-12-15T14:38:13Z

Well the HDF5 operations have been happening before on that same container runtime, so I suspect it a general issue on lack of locks rather than specific to HDF5 (if you see, beaker is also complaining). I'm trying now mounting as nfs 4.2 instead of nfs3 and see if the locking mechanism, which is improved there, behaves better.

pcm32 · 2022-12-15T14:39:18Z

I'm also putting the mulled lock_dir in a local directory of the container (I presume it won't use too much space, correct?) as suggested by Marius on Gitter.

pcm32 · 2022-12-15T14:55:42Z

Having mulled lock dir in a local directory also seems to speed up initial provisioning of jobs. I used to sometimes see the job handler log stuck in that operation of resolving the container for some seconds when it was on the shared fs.

nuwang · 2022-12-16T03:14:09Z

@pcm32 It sounds sensible to change the defaults here. Can you make a PR with the changes you made?

pcm32 · 2022-12-16T11:53:56Z

So in my current setup I left it here:

mulled_resolution_cache_lock_dir: /galaxy/server/local_mulled_cache

are yo happy with that location? should we have perhaps a /galaxy/server/local directory where we move stuff that we explicitly want per pod? or please let me know what would be the preferred path.

nuwang · 2022-12-16T13:18:42Z

I think the second option of having a /galaxy/server/local to house all per pod stuff sounds good. That can be the complementary dir to /galaxy/server/database.

pcm32 · 2022-12-19T09:14:30Z

On an orthogonal axis of solution I moved as well from nfs v3 to v4.2. So far it seems to help as well.

mvdbeek mentioned this issue Dec 15, 2022

Support sqla / database backend for beaker galaxyproject/galaxy#15216

Closed

pcm32 mentioned this issue Dec 16, 2022

Sets the mulled resolution cache lock dir to local #402

Merged

claudiofr mentioned this issue Feb 22, 2023

Expose additional beaker caching backends galaxyproject/galaxy#15349

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run out of file locks after an hour of running (around ~100 jobs) #399

Run out of file locks after an hour of running (around ~100 jobs) #399

pcm32 commented Dec 15, 2022

nuwang commented Dec 15, 2022 •

edited

Loading

pcm32 commented Dec 15, 2022

pcm32 commented Dec 15, 2022

pcm32 commented Dec 15, 2022

nuwang commented Dec 16, 2022

pcm32 commented Dec 16, 2022 •

edited

Loading

nuwang commented Dec 16, 2022

pcm32 commented Dec 19, 2022

Run out of file locks after an hour of running (around ~100 jobs) #399

Run out of file locks after an hour of running (around ~100 jobs) #399

Comments

pcm32 commented Dec 15, 2022

nuwang commented Dec 15, 2022 • edited Loading

pcm32 commented Dec 15, 2022

pcm32 commented Dec 15, 2022

pcm32 commented Dec 15, 2022

nuwang commented Dec 16, 2022

pcm32 commented Dec 16, 2022 • edited Loading

nuwang commented Dec 16, 2022

pcm32 commented Dec 19, 2022

nuwang commented Dec 15, 2022 •

edited

Loading

pcm32 commented Dec 16, 2022 •

edited

Loading