Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rmtree for nfs #2434

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

zulissimeta
Copy link
Contributor

@zulissimeta zulissimeta commented Aug 19, 2024

Summary of Changes

Some filesystems like NFS don't delete files right away, especially if another node might be holding access to them. This manifests as files like .nfs.... in the directory being deleted. rmtree will complain after deleting files and trying to delete the folder.

This PR adapts a utility from another open sources (BSD-3-clause) license repo that deals with this problem for clusters with NFS. Basically, if we hit one of the common errors (resource or device is busy), we wait a few seconds and try again.

I don't know how to reproduce this test without an NFS filesystem, but the existing tests should cover this working as expected for file deletion.

Requirements

Note: If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.

@buildbot-princeton
Copy link
Collaborator

Can one of the admins verify this patch?

Copy link

codecov bot commented Aug 19, 2024

Codecov Report

Attention: Patch coverage is 55.55556% with 8 lines in your changes missing coverage. Please review.

Project coverage is 98.16%. Comparing base (bd45301) to head (49f3756).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
src/quacc/runners/prep.py 55.55% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2434      +/-   ##
==========================================
- Coverage   98.38%   98.16%   -0.23%     
==========================================
  Files          85       85              
  Lines        3477     3495      +18     
==========================================
+ Hits         3421     3431      +10     
- Misses         56       64       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Andrew-S-Rosen
Copy link
Member

@zulissimeta: Happy to accept something along these lines. First, a quick question though: is it unavoidable that the .nfs files will be present? For instance, this comment suggests that I/O on the node could cause this (e.g. from the logging module). I want to confirm that killing the logger before calling shutil.rmtree isn't sufficient here in terms of a fix.

Copy link
Member

@Andrew-S-Rosen Andrew-S-Rosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor things (in addition to my question above).

Comment on lines +182 to +186
def rmtree(*args, max_retries=3, retry_wait=10, **kwargs):
"""
rmtree that will retry if we get common NFS errors (files still being deleted, etc)
Adapted from https://github.com/teojgo/reframe/blob/master/reframe/utility/osext.py
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can have some docstrings and type hints here, that would be greatly appreciated. For instance, it's not immediately clear what **kwargs is until you read the source code.

Comment on lines +196 to +197
if i == max_retries:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we raise a specific exception and have a useful message?

Comment on lines +200 to +201
else:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here with the exception to raise.

@@ -174,3 +177,25 @@ def terminate(tmpdir: Path | str, exception: Exception) -> None:
symlink_path.symlink_to(job_failed_dir, target_is_directory=True)

raise JobFailure(job_failed_dir, message=msg, parent_error=exception) from exception


def rmtree(*args, max_retries=3, retry_wait=10, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're good with a 10 s default? Not too long for you?

@zulissimeta
Copy link
Contributor Author

@zulissimeta: Happy to accept something along these lines. First, a quick question though: is it unavoidable that the .nfs files will be present? For instance, this comment suggests that I/O on the node could cause this (e.g. from the logging module). I want to confirm that killing the logger before calling shutil.rmtree isn't sufficient here in terms of a fix.

I haven't been able to reproduce it consistently. Is the logger writing local files?

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Aug 20, 2024

@zulissimeta: Happy to accept something along these lines. First, a quick question though: is it unavoidable that the .nfs files will be present? For instance, this comment suggests that I/O on the node could cause this (e.g. from the logging module). I want to confirm that killing the logger before calling shutil.rmtree isn't sufficient here in terms of a fix.

I haven't been able to reproduce it consistently. Is the logger writing local files?

I think by default it writes to stderr, if relevant. So, depends on if you are writing stderr to disk. I suppose that's likely not the problem since your stderr would likely not be in the directory being purged. In any case, I just made this toggleable in #2436.

@Andrew-S-Rosen
Copy link
Member

@zulissimeta: Out of curiosity, are you running ASE relaxations? I think there might be a problem with the trajectory not closing properly. I have vague recollections of this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants