-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run out of file locks after an hour of running (around ~100 jobs) #399
Comments
@pcm32 What if |
Well the HDF5 operations have been happening before on that same container runtime, so I suspect it a general issue on lack of locks rather than specific to HDF5 (if you see, beaker is also complaining). I'm trying now mounting as nfs 4.2 instead of nfs3 and see if the locking mechanism, which is improved there, behaves better. |
I'm also putting the mulled lock_dir in a local directory of the container (I presume it won't use too much space, correct?) as suggested by Marius on Gitter. |
Having mulled lock dir in a local directory also seems to speed up initial provisioning of jobs. I used to sometimes see the job handler log stuck in that operation of resolving the container for some seconds when it was on the shared fs. |
@pcm32 It sounds sensible to change the defaults here. Can you make a PR with the changes you made? |
So in my current setup I left it here:
are yo happy with that location? should we have perhaps a |
I think the second option of having a |
On an orthogonal axis of solution I moved as well from nfs v3 to v4.2. So far it seems to help as well. |
After around an hour of running I'm starting to get all sort of file locking errors in the main job handler:
mostly when finding the container descriptor (I wonder if that could be cached, or if this is the cache). But also when it sets some metadata fields:
I wonder if some of the elements in the
database
directory could be moved out of the shared file system and this could alleviate the amount of locks requested? Or this would be an OS wide issue on the container?which one of this could live in RWO volumes for instance? Thanks!
The text was updated successfully, but these errors were encountered: