Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(lfs): make HSM interactions more robust #210

Merged
merged 2 commits into from
Nov 8, 2024
Merged

fix(lfs): make HSM interactions more robust #210

merged 2 commits into from
Nov 8, 2024

Conversation

ketiltrout
Copy link
Member

@ketiltrout ketiltrout commented Nov 6, 2024

All lfs calls need a timeout

It seems pretty clear now that lfs invocation should always have a timeout to guard against Lustre locking-up all our workers. So, now use a 1-minute timeout by default in run_lfs.

Then, handle both lfs quota and lfs hsm_state not returning successfully. In StorageNode.update_avail_gb, passing None now means don't update avail.gb (but do still update the last check time). If the current value of avail_gb is not null/None, a warning is issued.

For hsm_state failing, various places in LustreHSM where it was used now handles not being able to determine the state.

Also removed an unnecessary lfs quota invocation from the idle update LustreHSMNodeIO.release_files()

Using hsm_action to probe HSM restores in-progress

I was reading through the lustre source code last night to see if I could figure out what's up with hsm_restore timing out when files are in the process of being restored.

I didn't figure that out because I discovered, instead, the hsm_action command which tells you what HSM is currently doing to a file (e.g. is it working on restoring it?).

So, there's another HSMState: RESTORING for files which HSM is in the process of restoring, and alpenhorn will now use that to track progress during restores. This removes the need for _restore_retry (because alpenhorn no longer needs to guess as to whether a restore is happening or not).

Closes #198 (because those constants are no longer present).

Removing stats from LustreHSM

The idea here is: using stat() to check for file existance, while avoiding making an external lfs call, causes trouble when /nearline is acting up because the stat() will get a worker stuck in IO-wait, while the lfs call can be abandonned (by subprocess timeout) meaning the workers don't get stuck.

So, I've removed the stat from at the top of lfs.hsm_state. I originally had it there because I thought that a cheap stat would save us from an expensive lfs call. I've modified lfs_run to detect and report missing files instead.

I've also re-implemented LustreHSMNodeIO.exists to use lfs instead of stat-ting the filesystem.

@ketiltrout ketiltrout marked this pull request as draft November 6, 2024 18:47
@ketiltrout
Copy link
Member Author

Draft while I do some testing on cedar...

@ketiltrout
Copy link
Member Author

ketiltrout commented Nov 6, 2024

I'm going to leave this as a draft until /nearline gets fixed and I can test the hsm_action call. I've started running this branch on the robot node.

@ketiltrout ketiltrout force-pushed the restoring branch 2 times, most recently from f292d56 to 9378aa8 Compare November 6, 2024 21:47
@ketiltrout ketiltrout changed the title fix(lfs): use hsm_action to track in-progress restores fix(lfs): make HSM interactions more robust Nov 6, 2024
ketiltrout added a commit that referenced this pull request Nov 7, 2024
Something I found while testing #210:

If we want db._base.threadsafe to be imported into `__init__`, it can't
be a scalar (because it's going to change after the import).

Instead, make it a function.

Bonus: no need to try to import `subprocess` every time a command is
run. Just import it once at module import.
@ketiltrout ketiltrout force-pushed the restoring branch 5 times, most recently from 27dbd35 to 956f763 Compare November 7, 2024 22:57
== Use a timeout always with lfs
It seems pretty clear now that `lfs` invocation should always have
a timeout to guard against Lustre locking-up all our workers.  So,
now use a 1-minute timeout by default in `run_lfs`.

Then, handle both `lfs quota` and `lfs hsm_state` not returning
successfully.  In `StorageNode.update_avail_gb`, passing `None`
now means don't update `avail.gb` (but do still update the last
check time).  If the current value of `avail_gb` is not null/None,
a warning is issued.

For `hsm_state` failing, various places in `LustreHSM` where it
was used now handles not being able to determine the state.

== Use hsm_action to track in-progress restores
I was reading through the lustre source code last night to see
if I could figure out what's up with `hsm_restore` timing out when
files are in the process of being restored.

I didn't figure that out because I discovered, instead, the
`hsm_action` command which tells you what HSM is currently doing to
a file (e.g. is it working on restoring it?).

So, there's another `HSMState`: `RESTORING` for files which HSM
is in the process of restoring, and alpenhorn will now use that to
track progress during restores.  This removes the need for
`_restore_retry` (because alpenhorn no longer needs to guess as to
whether a restore is happening or not).

== Removing stats from LustreHSM
The idea here is: using `stat()` to check for file existance,
while avoiding making an external `lfs` call, causes trouble when
/nearline is acting up because the `stat()` will get a worker stuck
in IO-wait, while the `lfs` call can be abandonned (by subprocess
timeout) meaning the workers don't get stuck.

So, I've removed the stat from at the top of `lfs.hsm_state`. I
originally had it there because I thought that a cheap `stat` would
save us from an expensive `lfs` call.  I've modified `lfs_run` to
detect and report missing files instead.

I've also re-implemented `LustreHSMNodeIO.exists` to use `lfs`
instead of stat-ting the filesystem.
@ketiltrout
Copy link
Member Author

The nearline HSM on cedar has unstuck itself sufficiently that's I'm fairly confident this is now working as intended.

alpenhorn/io/lfs.py Outdated Show resolved Hide resolved
@ketiltrout ketiltrout merged commit 546a464 into master Nov 8, 2024
3 checks passed
@ketiltrout ketiltrout deleted the restoring branch November 8, 2024 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add hsm_restore times to config
2 participants