-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(lfs): make HSM interactions more robust
== Use a timeout always with lfs It seems pretty clear now that `lfs` invocation should always have a timeout to guard against Lustre locking-up all our workers. So, now use a 1-minute timeout by default in `run_lfs`. Then, handle both `lfs quota` and `lfs hsm_state` not returning successfully. In `StorageNode.update_avail_gb`, passing `None` now means don't update `avail.gb` (but do still update the last check time). If the current value of `avail_gb` is not null/None, a warning is issued. For `hsm_state` failing, various places in `LustreHSM` where it was used now handles not being able to determine the state. == Use hsm_action to track in-progress restores I was reading through the lustre source code last night to see if I could figure out what's up with `hsm_restore` timing out when files are in the process of being restored. I didn't figure that out because I discovered, instead, the `hsm_action` command which tells you what HSM is currently doing to a file (e.g. is it working on restoring it?). So, there's another `HSMState`: `RESTORING` for files which HSM is in the process of restoring, and alpenhorn will now use that to track progress during restores. This removes the need for `_restore_retry` (because alpenhorn no longer needs to guess as to whether a restore is happening or not). == Removing stats from LustreHSM The idea here is: using `stat()` to check for file existance, while avoiding making an external `lfs` call, causes trouble when /nearline is acting up because the `stat()` will get a worker stuck in IO-wait, while the `lfs` call can be abandonned (by subprocess timeout) meaning the workers don't get stuck. So, I've removed the stat from at the top of `lfs.hsm_state`. I originally had it there because I thought that a cheap `stat` would save us from an expensive `lfs` call. I've modified `lfs_run` to detect and report missing files instead. I've also re-implemented `LustreHSMNodeIO.exists` to use `lfs` instead of stat-ting the filesystem.
- Loading branch information
1 parent
790ec51
commit 742ae05
Showing
11 changed files
with
392 additions
and
157 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.