-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(lfs): make HSM interactions more robust (#210)
## All `lfs` calls need a timeout It seems pretty clear now that `lfs` invocation should always have a timeout to guard against Lustre locking-up all our workers. So, now use a 1-minute timeout by default in `run_lfs`. Then, handle both `lfs quota` and `lfs hsm_state` not returning successfully. In `StorageNode.update_avail_gb`, passing `None` now means don't update `avail.gb` (but do still update the last check time). If the current value of `avail_gb` is not null/None, a warning is issued. For `hsm_state` failing, various places in `LustreHSM` where it was used now handles not being able to determine the state. Also removed an unnecessary `lfs quota` invocation from the idle update `LustreHSMNodeIO.release_files()` ## Using `hsm_action` to probe HSM restores in-progress I was reading through the lustre source code last night to see if I could figure out what's up with `hsm_restore` timing out when files are in the process of being restored. I didn't figure that out because I discovered, instead, the `hsm_action` command which tells you what HSM is currently doing to a file (e.g. is it working on restoring it?). So, there's another `HSMState`: `RESTORING` for files which HSM is in the process of restoring, and alpenhorn will now use that to track progress during restores. This removes the need for `_restore_retry` (because alpenhorn no longer needs to guess as to whether a restore is happening or not). Closes #198 (because those constants are no longer present). ## Removing stats from LustreHSM The idea here is: using `stat()` to check for file existance, while avoiding making an external `lfs` call, causes trouble when `/nearline` is acting up because the `stat()` will get a worker stuck in IO-wait, while the `lfs` call can be abandonned (by subprocess timeout) meaning the workers don't get stuck. So, I've removed the stat from at the top of `lfs.hsm_state`. I originally had it there because I thought that a cheap `stat` would save us from an expensive `lfs` call. I've modified `lfs_run` to detect and report missing files instead. I've also re-implemented `LustreHSMNodeIO.exists` to use `lfs` instead of stat-ting the filesystem.
- Loading branch information
1 parent
790ec51
commit 546a464
Showing
12 changed files
with
412 additions
and
157 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.