fix(lfs): make HSM interactions more robust #210

ketiltrout · 2024-11-06T18:47:48Z

All `lfs` calls need a timeout

It seems pretty clear now that lfs invocation should always have a timeout to guard against Lustre locking-up all our workers. So, now use a 1-minute timeout by default in run_lfs.

Then, handle both lfs quota and lfs hsm_state not returning successfully. In StorageNode.update_avail_gb, passing None now means don't update avail.gb (but do still update the last check time). If the current value of avail_gb is not null/None, a warning is issued.

For hsm_state failing, various places in LustreHSM where it was used now handles not being able to determine the state.

Also removed an unnecessary lfs quota invocation from the idle update LustreHSMNodeIO.release_files()

Using `hsm_action` to probe HSM restores in-progress

I was reading through the lustre source code last night to see if I could figure out what's up with hsm_restore timing out when files are in the process of being restored.

I didn't figure that out because I discovered, instead, the hsm_action command which tells you what HSM is currently doing to a file (e.g. is it working on restoring it?).

So, there's another HSMState: RESTORING for files which HSM is in the process of restoring, and alpenhorn will now use that to track progress during restores. This removes the need for _restore_retry (because alpenhorn no longer needs to guess as to whether a restore is happening or not).

Closes #198 (because those constants are no longer present).

Removing stats from LustreHSM

The idea here is: using stat() to check for file existance, while avoiding making an external lfs call, causes trouble when /nearline is acting up because the stat() will get a worker stuck in IO-wait, while the lfs call can be abandonned (by subprocess timeout) meaning the workers don't get stuck.

So, I've removed the stat from at the top of lfs.hsm_state. I originally had it there because I thought that a cheap stat would save us from an expensive lfs call. I've modified lfs_run to detect and report missing files instead.

I've also re-implemented LustreHSMNodeIO.exists to use lfs instead of stat-ting the filesystem.

ketiltrout · 2024-11-06T18:48:10Z

Draft while I do some testing on cedar...

ketiltrout · 2024-11-06T21:07:23Z

I'm going to leave this as a draft until /nearline gets fixed and I can test the hsm_action call. I've started running this branch on the robot node.

Something I found while testing #210: If we want db._base.threadsafe to be imported into `__init__`, it can't be a scalar (because it's going to change after the import). Instead, make it a function. Bonus: no need to try to import `subprocess` every time a command is run. Just import it once at module import.

== Use a timeout always with lfs It seems pretty clear now that `lfs` invocation should always have a timeout to guard against Lustre locking-up all our workers. So, now use a 1-minute timeout by default in `run_lfs`. Then, handle both `lfs quota` and `lfs hsm_state` not returning successfully. In `StorageNode.update_avail_gb`, passing `None` now means don't update `avail.gb` (but do still update the last check time). If the current value of `avail_gb` is not null/None, a warning is issued. For `hsm_state` failing, various places in `LustreHSM` where it was used now handles not being able to determine the state. == Use hsm_action to track in-progress restores I was reading through the lustre source code last night to see if I could figure out what's up with `hsm_restore` timing out when files are in the process of being restored. I didn't figure that out because I discovered, instead, the `hsm_action` command which tells you what HSM is currently doing to a file (e.g. is it working on restoring it?). So, there's another `HSMState`: `RESTORING` for files which HSM is in the process of restoring, and alpenhorn will now use that to track progress during restores. This removes the need for `_restore_retry` (because alpenhorn no longer needs to guess as to whether a restore is happening or not). == Removing stats from LustreHSM The idea here is: using `stat()` to check for file existance, while avoiding making an external `lfs` call, causes trouble when /nearline is acting up because the `stat()` will get a worker stuck in IO-wait, while the `lfs` call can be abandonned (by subprocess timeout) meaning the workers don't get stuck. So, I've removed the stat from at the top of `lfs.hsm_state`. I originally had it there because I thought that a cheap `stat` would save us from an expensive `lfs` call. I've modified `lfs_run` to detect and report missing files instead. I've also re-implemented `LustreHSMNodeIO.exists` to use `lfs` instead of stat-ting the filesystem.

ketiltrout · 2024-11-08T19:48:27Z

The nearline HSM on cedar has unstuck itself sufficiently that's I'm fairly confident this is now working as intended.

alpenhorn/io/lfs.py

ketiltrout marked this pull request as draft November 6, 2024 18:47

ketiltrout mentioned this pull request Nov 6, 2024

fix(db): Fix db.threadsafe #212

Merged

ketiltrout force-pushed the restoring branch 2 times, most recently from f292d56 to 9378aa8 Compare November 6, 2024 21:47

ketiltrout changed the title ~~fix(lfs): use hsm_action to track in-progress restores~~ fix(lfs): make HSM interactions more robust Nov 6, 2024

ketiltrout force-pushed the restoring branch 5 times, most recently from 27dbd35 to 956f763 Compare November 7, 2024 22:57

ketiltrout force-pushed the restoring branch from 956f763 to 742ae05 Compare November 8, 2024 19:45

ketiltrout marked this pull request as ready for review November 8, 2024 19:47

ketiltrout requested review from ljgray and rikvl November 8, 2024 19:47

ljgray approved these changes Nov 8, 2024

View reviewed changes

alpenhorn/io/lfs.py Outdated Show resolved Hide resolved

Add lfs_timeout to I/O config

ae4c2a4

ketiltrout merged commit 546a464 into master Nov 8, 2024
3 checks passed

ketiltrout deleted the restoring branch November 8, 2024 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lfs): make HSM interactions more robust #210

fix(lfs): make HSM interactions more robust #210

ketiltrout commented Nov 6, 2024 •

edited

Loading

ketiltrout commented Nov 6, 2024

ketiltrout commented Nov 6, 2024 •

edited

Loading

ketiltrout commented Nov 8, 2024

fix(lfs): make HSM interactions more robust #210

fix(lfs): make HSM interactions more robust #210

Conversation

ketiltrout commented Nov 6, 2024 • edited Loading

All lfs calls need a timeout

Using hsm_action to probe HSM restores in-progress

Removing stats from LustreHSM

ketiltrout commented Nov 6, 2024

ketiltrout commented Nov 6, 2024 • edited Loading

ketiltrout commented Nov 8, 2024

ketiltrout commented Nov 6, 2024 •

edited

Loading

All `lfs` calls need a timeout

Using `hsm_action` to probe HSM restores in-progress

ketiltrout commented Nov 6, 2024 •

edited

Loading