-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RELION race condition when opening GUI #1198
Comments
This is very interesting. It is possible that a file system change on a host does not immediately become visible on other hosts but I think it is guaranteed by the POSIX specification that they are ordered within a process. What is the backend of your NFS storage? Cannot you fix this by changing server or client options (e.g. caching parameter)? Because nobody has reported this error before, I suspect this is very specific to your system and we are reluctant to incorporate changes only to address sub-optimally configured system. |
Thank you for taking the time to look at this issue.
In this case, I was just opening the RELION GUI so only a single host was involved. I agree it is very interesting and strange but as I understand it NFS is not POSIX compliant.
Our NFS storage system is a Dell PowerScale cluster (formerly known as an Isilon) that has been working extremely well for several years and no changes were made recently to any of the systems involved. I'm working on ways to test this more thoroughly but with the issue being intermittent it is difficult to recreate. I don't think our systems are sub-optimally configured but there may be something related to the cache causing weirdness. |
@clil16 Can you contact Dell support on the issue? |
I just wanted to chip in and mention that we have seen very similar behavior to this starting in the last few months. We have several instances where the lock removal error is triggered by users running jobs. This is generally with the v5 beta, but several different commits. We have seen this intermittently with users running jobs as usual via the GUI, and frequently when triggering jobs to run via the relion schemer. We see this on multiple client machines, and from several different storage servers running both NFS v3 and v4. |
@biochem-fan I can reach out to them regarding this issue but it seems like this may be an issue for others as well given the note by @sdrawson . |
@clil16 Please ask Dell. I believe it is file system's responsibility to keep the ordering of operations within a process. If they say it is not, please ask them what is guaranteed and what is not. |
Just to make sure, aren't you using the |
I will reach out to Dell and inquire. We are using the For reference, here are the mount options we specify on the client (The hostnames, paths, and IP addresses have been sanitized): |
Just for the record, we just tried 'sync' option and it is a lot worse than 'async'. Relion fails with the "error in removing directory .relion_lock" almost instantly. P.S. I am a coworker of @sdrawson. |
We use |
Recently we have been seeing people that had established RELION project directories lose the ability to open the RELION GUI in those project directories using the following versions of RELION: 4.0.1-commit-e5c483 and 5.0-beta-4-commit-33b2b0. Each time the behavior was the same but occasionally it would not occur.
We are using NFS v4.2 for our storage.
Environment:
Dataset:
Job options:
note.txt
in the job directory):Error message:
After following relion using strace I saw the following at the end of the strace output:
Once the program errors and exits the file
.relion_lock/lock_default_pipeline.star
is in fact removed (as the return code of 0 fromunlink
indicates) but it seems thermdir
on the directory is occurring too fast which is causing the error.I was able to fix this issue with the following patch for pipeliner.cpp on the 4.0.2 git tag :
I could create a pull request as well if you'd like.
The text was updated successfully, but these errors were encountered: